I probably wasn't clear - see below

On Nov 14, 2008, at 6:31 PM, Eugene Loh wrote:

Ralph Castain wrote:

I have two examples so far:

1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single node, 2ppn, with btl=openib,sm,self. The program started, but segfaulted on the first MPI_Send. No warnings were printed.

Interesting. So far as I can tell, the actual memory consumption (total number of allocations in the mmapped segment) for 2 local processes should be a little more than half a Mbyte. The bulk of that would be from fragments (chunks). There are btl_sm_free_list_num=8 per process, each of btl_sm_max_frag_size=32K. So, that's 8x2x32K=512Kbyte. Actually, a little bit more. Anyhow, that accounts for most of the allocations, I think. Maybe if you're sending a lot of data, more gets allocated at MPI_Send time. I don't know.

While only < 1 Mbyte is needed, however, mpool_sm_min_size=128M will be mapped.

Right - so then it sounds to me like this would fail (which it did) since /tmp was fixed to 10M - and the mpool would be much too large given a minimum size of 128M. Right?



It doesn't make sense that this case would fail, but the next case should run. Are you sure this is related to the SM backing file?

2. again with a ramdisk, /tmp was reportedly set to 16MB (unverified - some uncertainty, could be have been much larger). OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self. The program ran to completion without errors or warning. I don't know the communication pattern - could be no local comm was performed, though that sounds doubtful.

This case -did- run successfully. However, what puzzled me is that it seems like it shouldn't have run because the 128M minimum was still much larger than the available 16M.


One point that was made on an earlier thread - I don't know if either of these cases had a tmpfs file system. I will try to find out. My guess is "no" based on what I have been told so far - i.e., in both cases, I was told that /tmp's size was "fixed", but that might not be technically accurate.

As to whether we are sure about this being an SM backing file issue: no, we can't say with absolute certainty. However, I can offer two points of validation:

1. the test that failed (#1) ran perfectly when we set btl=^sm

2. the test that failed (#1) ran perfectly again after we increased / tmp to 512M

The test that did not fail (#2) has never failed for sm reasons as far as we know. We have had IB problems on occasion, but we believe that is unrelated to this issue.

My point here was simply that I have two cases, one that failed and one that didn't, that seem to me to be very similar. I don't understand the difference in behavior, and am concerned that users will be surprised - and spend a lot of energy trying to figure out what happened. The possibility Tim M raised about the tmpfs may explain the difference (if #2 used tmpfs and #1 didn't), and I will check that ASAP.

Hope that helps clarify - sorry for confusion.
Ralph




_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to