Re: [OMPI devel] SM backing file size

Eugene Loh Sat, 15 Nov 2008 12:29:06 -0500

Ralph Castain wrote:

I probably wasn't clear - see below
On Nov 14, 2008, at 6:31 PM, Eugene Loh wrote:
Ralph Castain wrote:
I have two examples so far:
1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a singlenode, 2ppn, with btl=openib,sm,self. The program started, butsegfaulted on the first MPI_Send. No warnings were printed.
Interesting. So far as I can tell, the actual memory consumption(total number of allocations in the mmapped segment) for 2 localprocesses should be a little more than half a Mbyte. The bulk ofthat would be from fragments (chunks). There arebtl_sm_free_list_num=8 per process, each ofbtl_sm_max_frag_size=32K. So, that's 8x2x32K=512Kbyte. Actually, alittle bit more. Anyhow, that accounts for most of the allocations,I think. Maybe if you're sending a lot of data, more gets allocatedat MPI_Send time. I don't know.
While only < 1 Mbyte is needed, however, mpool_sm_min_size=128M willbe mapped.
Right - so then it sounds to me like this would fail (which it did)since /tmp was fixed to 10M - and the mpool would be much too largegiven a minimum size of 128M. Right?


That makes sense to me.

My analysis of how little of the mapped segment will actually be used isprobably irrelevent.


Here is what I think should happen:

*) The lowest ranking process on the node opens and ftruncates thefile. Since there isn't enough space, the ftruncate fails. This is inmca_common_sm_mmap_init() in ompi/mca/common/sm/common_sm_mmap.c.

*) The value sm_inited==0 is broadcast from this process to all otherlocal processes.


*) Nobody tries to mmap the file.

*) On each local process, mca_common_sm_mmap_init() returns a NULL mapto mca_mpool_sm_init(). This, incidentally, is the function where thesize of the backing file is determined, bounded by those max/min parameters.


*) In turn, mca_mpool_sm_init() returns a NULL value.

*) Therefore, sm_btl_first_time_init() returns OMPI_ERROR.

*) Therefore, mca_btl_sm_add_procs() goes into "CLEANUP" and returnsOMPI_ERROR.

*) Therefore, mca_bml_r2_add_procs() gives up on this BTL and tries toestablish connections otherwise.

I'm a little clear what should happen next. But, to reiterate, alllocal processes should fail and indicate to the BML that the sm BTLwasn't going to work for them.

It doesn't make sense that this case would fail, but the next caseshould run. Are you sure this is related to the SM backing file?

Sorry, let me take that back. It does make some sense that the firstcase would fail. The possible exception is if the connections fall overto another BTL (openib, I presume).


What's weird is that the second case runs.

2. again with a ramdisk, /tmp was reportedly set to 16MB(unverified - some uncertainty, could be have been much larger).OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.The program ran to completion without errors or warning. I don'tknow the communication pattern - could be no local comm wasperformed, though that sounds doubtful.
This case -did- run successfully. However, what puzzled me is that itseems like it shouldn't have run because the 128M minimum was stillmuch larger than the available 16M.


Right.  Weird.

One point that was made on an earlier thread - I don't know if eitherof these cases had a tmpfs file system. I will try to find out. Myguess is "no" based on what I have been told so far - i.e., in bothcases, I was told that /tmp's size was "fixed", but that might not betechnically accurate.
As to whether we are sure about this being an SM backing file issue:no, we can't say with absolute certainty. However, I can offer twopoints of validation:
1. the test that failed (#1) ran perfectly when we set btl=^sm
2. the test that failed (#1) ran perfectly again after we increased /tmp to 512M
The test that did not fail (#2) has never failed for sm reasons asfar as we know. We have had IB problems on occasion, but we believethat is unrelated to this issue.
My point here was simply that I have two cases, one that failed andone that didn't, that seem to me to be very similar. I don'tunderstand the difference in behavior, and am concerned that userswill be surprised - and spend a lot of energy trying to figure outwhat happened. The possibility Tim M raised about the tmpfs mayexplain the difference (if #2 used tmpfs and #1 didn't), and I willcheck that ASAP.


I share your surprise.

Incidentally, does the MPI program test the return value from MPI_Init?Another thing I've wondered about is if OMPI fails in MPI_Init() andcorrectly indicates this to the user, but the user doesn't check theMPI_Init() return value.


User:  You were broken!
OMPI:  Yes, I know!  I TOLD you I was broken, but you didn't listen.

Re: [OMPI devel] SM backing file size

Reply via email to