Agreed.
On 11/14/08 9:56 AM, "Ralph Castain" <r...@lanl.gov> wrote: > > On Nov 14, 2008, at 7:41 AM, Richard Graham wrote: > >> Just a few comments: >> - not sure what sort of alternative memory approach is being considered. >> The current approach was selected for two reasons: >> - If something like anonymous memory is being used, one can only inherit >> access to the shared files, so one process needs >> set up the shared memory regions, and then fork() the procs that will >> use it. This usually implies that to do this portably, >> this needs to happen inside of MPI_Init(), so up to that stage only >> one process runs on each host. Also, unrelated procs can¹t >> access this memory can¹t use this in the context of Fault Tolerance. >> - The approach used here is very efficient for small systems, so >> alternatives should be added to what is in place here, so we >> don¹t loose the performance potential on small SMP¹s, which still >> describes the vast majority of systems. > > I concur - however, note that the segv occurred on a 4ppn system, which I > think we would all agree constitutes a small SMP. I believe that the > alternative memory approach needs to be a separate component, but I also > believe that we need to modify the existing component so it doesn't segv if > adequate memory isn't found. > > Just my $.002 > >> >> >> Rich >> >> >> On 11/14/08 9:22 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >> >> >>> Ok. Should be pretty easy to test/simulate to figure out what's going >>> on -- e.g., whether it's segv'ing in MPI_INIT or the first MPI_SEND. >>> >>> >>> On Nov 14, 2008, at 9:19 AM, Ralph Castain wrote: >>> >>>> > Until we do complete the switch, and for systems that do not support >>>> > the alternate type of shared memory (I believe it is only Linux?), I >>>> > truly believe we should do something nicer than segv. >>>> > >>>> > Just to clarify: I know the segv case was done with paffinity set, >>>> > and believe both cases were done that way. In the first case, I was >>>> > told that the segv hit when they did MPI_Send, but I did not >>>> > personally verify that claim - it could be that it hit during >>>> > maffinity binding if, as you suggest, we actually touch the page at >>>> > that time. >>>> > >>>> > Ralph >>>> > >>>> > >>>> > >>>> > On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote: >>>> > >>>>> >> It's been a looooong time since I've looked at the sm code; Eugene >>>>> >> has looked at it much more in-depth recently than I have. But I'm >>>>> >> guessing we *haven't* checked this stuff to abort nicely in such >>>>> >> error conditions. We might very well succeed in the mmap but then >>>>> >> segv later when the memory isn't actually available. Perhaps we >>>>> >> should try to touch every page first to ensure that it's actually >>>>> >> there...? (I'm pretty sure we do this when using paffinity to >>>>> >> ensure to maffinity bind memory to processors -- perhaps we're not >>>>> >> doing that in the !paffinity case?) >>>>> >> >>>>> >> Additionally, another solution might well be what Tim has long >>>>> >> advocated: switch to the other type of shared memory on systems >>>>> >> that support auto-pruning it when all processes die, and/or have >>>>> >> the orted kill it when all processes die. Then a) we're not >>>>> >> dependent on the filesystem free space, and b) we're not writing >>>>> >> all the dirty pages to disk when the processes exit. >>>>> >> >>>>> >> >>>>> >> >>>>> >> On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote: >>>>> >> >>>>>> >>> Hi Eugene >>>>>> >>> >>>>>> >>> I too am interested - I think we need to do something about the sm >>>>>> >>> backing file situation as larger core machines are slated to >>>>>> >>> become more prevalent shortly. >>>>>> >>> >>>>>> >>> I appreciate your info on the sizes and controls. One other >>>>>> >>> question: what happens when there isn't enough memory to support >>>>>> >>> all this? Are we smart enough to detect this situation? Does the >>>>>> >>> sm subsystem quietly shut down? Warn and shut down? Segfault? >>>>>> >>> >>>>>> >>> I have two examples so far: >>>>>> >>> >>>>>> >>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single >>>>>> >>> node, 2ppn, with btl=openib,sm,self. The program started, but >>>>>> >>> segfaulted on the first MPI_Send. No warnings were printed. >>>>>> >>> >>>>>> >>> 2. again with a ramdisk, /tmp was reportedly set to 16MB >>>>>> >>> (unverified - some uncertainty, could be have been much larger). >>>>>> >>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self. >>>>>> >>> The program ran to completion without errors or warning. I don't >>>>>> >>> know the communication pattern - could be no local comm was >>>>>> >>> performed, though that sounds doubtful. >>>>>> >>> >>>>>> >>> If someone doesn't know, I'll have to dig into the code and figure >>>>>> >>> out the response - just hoping that someone can spare me the pain. >>>>>> >>> >>>>>> >>> Thanks >>>>>> >>> Ralph >>>>>> >>> >>>>>> >>> >>>>>> >>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote: >>>>>> >>> >>>>>>> >>>> Ralph Castain wrote: >>>>>>> >>>> >>>>>>>> >>>>> As has frequently been commented upon at one time or another, >>>>>>>> >>>>> the shared memory backing file can be quite huge. There used to >>>>>>>> >>>>> be a param for controlling this size, but I can't find it in >>>>>>>> >>>>> 1.3 - or at least, the name or method for controlling file size >>>>>>>> >>>>> has morphed into something I don't recognize. >>>>>>>> >>>>> >>>>>>>> >>>>> Can someone more familiar with that subsystem point me to one or >>>>>>>> >>>>> more params that will allow us to control the size of that >>>>>>>> >>>>> file? It is swamping our systems and causing OMPI to segfault. >>>>>>> >>>> >>>>>>> >>>> Sounds like you've already gotten your answers, but I'll add my >>>>>>> >>>> $0.02 anyhow. >>>>>>> >>>> >>>>>>> >>>> The file size is the number of local processes (call it n) times >>>>>>> >>>> mpool_sm_per_peer_size (default 32M), but with a minimum of >>>>>>> >>>> mpool_sm_min_size (default 128M) and a maximum of >>>>>>> >>>> mpool_sm_max_size (default 2G? 256M?). So, you can tweak those >>>>>>> >>>> parameters to control file size. >>>>>>> >>>> >>>>>>> >>>> Another issue is possibly how small a backing file you can get >>>>>>> >>>> away with. That is, just forcing the file to be smaller may not >>>>>>> >>>> be enough since your job may no longer run. The backing file >>>>>>> >>>> seems to be used mainly by: >>>>>>> >>>> >>>>>>> >>>> *) eager-fragment free lists: We start with enough eager >>>>>>> >>>> fragments so that we could have two per connection. So, you >>>>>>> >>>> could bump the sm eager size down if you need to shoehorn a job >>>>>>> >>>> into a very small backing file. >>>>>>> >>>> >>>>>>> >>>> *) large-fragment free lists: We start with 8*n large >>>>>>> >>>> fragments. If this term plagues you, you can bump the sm chunk >>>>>>> >>>> size down or reduce the value of 8 (using btl_sm_free_list_num, I >>>>>>> >>>> think). >>>>>>> >>>> >>>>>>> >>>> *) FIFOs: The code tries to align a number of things on pagesize >>>>>>> >>>> boundaries, so you end up with about 3*n*n*pagesize overhead >>>>>>> >>>> here. If this term is causing you problems, you're stuck (unless >>>>>>> >>>> you modify OMPI). >>>>>>> >>>> >>>>>>> >>>> I'm interested in this subject! :^) >>>>>>> >>>> _______________________________________________ >>>>>>> >>>> devel mailing list >>>>>>> >>>> de...@open-mpi.org >>>>>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>> >>>>>> >>> _______________________________________________ >>>>>> >>> devel mailing list >>>>>> >>> de...@open-mpi.org >>>>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Jeff Squyres >>>>> >> Cisco Systems >>>>> >> >>>>> >> _______________________________________________ >>>>> >> devel mailing list >>>>> >> de...@open-mpi.org >>>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> > >>>> > _______________________________________________ >>>> > devel mailing list >>>> > de...@open-mpi.org >>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel