I was reminded this morning (by 2 people :-) ) that the sysv shmem stuff was initiated a long time ago as a workaround for many of these same issues (including the potential performance issues).
Sam's work is nearly complete; I think that -- at least on Linux -- the mmap performance issues can go away. The cleanup issues will not go away; it still requires external help to *guarantee* that shared memory IDs are removed after the job has completed. On May 18, 2010, at 8:45 AM, Jeff Squyres (jsquyres) wrote: > Ralph and I talked about this on the phone a bit this morning. There's > several complicating factors in using /dev/shm (aren't there always? :-) ). > > 0. Note that anything in /dev/shm will need to have session-directory-like > semantics: there needs to be per-user and per-job characteristics (e.g., if > the same user launches multiple jobs on the same node, etc.). > > 1. It is not necessarily a good idea to put the entire session directory in > /dev/shm. It's not just the shared memory files that go in the session > directory; there's a handful of other meta data files that go in there as > well. Those files don't take up much space, but it still feels wrong to put > anything other that shared memory files in there. Indeed, checkpoint files > and filem files can go in there -- these can eat up lots of space (RAM). > > 2. /dev/shm may not be configured right, and/or there are possible /dev/shm > configurations where you *do* use twice the memory (Ralph cited an example of > a nameless organization that had exactly this problem -- we don't know if > this was a misconfiguration or whether it was done on purpose for some > reason). I don't know if kernel version comes into play here, too (e.g., if > older Linux kernel versions did double the memory, or somesuch). So it's not > necessarily a slam dunk that you *always* want to do this. > > 3. The session directory has "best effort" cleanup at the end of the job: > > - MPI jobs (effectively) rm -rf the session directory > - The orted (effectively) rm -rf's the session directory > > But neither of these are *guaranteed* -- for example, if the resource manager > kills the job with extreme prejudice, the session directory can be left > around. Where possible, ORTE tries to put the session directory in a > resource manager job-specific-temp directory so that the resource manager > itself whacks the session directory at the end of the job. But this isn't > always the case. > > So the session directory has 2 levels of attempted cleanup (MPI procs and > orted), and sometimes a 3rd (the resource manager). > > 3a. If the session directory is in /dev/shm, we get the 2 levels but > definitely not the 3rd (note: I don't think that putting the session > directory is a good idea, per #1 -- I'm just being complete). > > 3b. If the shared memory files are outside the session directory, we don't > get any of the additional cleanup without adding some additional > infrastructure -- possibly into orte/util/session_dir.* (e.g., add /dev/shm > as a secondary session directory root). This would allow us to effect > session directory-like semantics inside /dev/shm. > > 4. But even with 2 levels of possible cleanup, not having the resource > manager cleanup can be quite disastrous if shared memory is left around after > a job is forcibly terminated. Sysadmins can do stuff like rm -rf /dev/shm > (or whatever) between jobs to guarantee cleanup, but it would be extra steps > required outside of OMPI. > > --> This seems to imply that using /dev/shm should not be default behavior. > > ----- > > All this being said, it seems like 3b is a reasonable way to go forward: > extend orte/util/session_dir.* to allow for multiple session directory roots > (somehow -- exact mechanism TBD). Then both the MPI processes and the orted > will try to clean up both the real session directory and /dev/shm. Both > roots will use the same per user/per job kinds of characteristics that the > session dir already has. > > Then we can extend the MCA param orte_tmpdir_base to accept a comma-delimited > list of roots. It still defaults to /tmp, but a sysadmin can set it to be > /tmp,/dev/shm (or whatever) if they want to use /dev/shm. OMPI will still do > "best effort" cleanup of /dev/shm, but it's the sysadmin's responsibility to > *guarantee* its cleanup after a job ends, etc. > > Thoughts? > > > On May 18, 2010, at 4:09 AM, Sylvain Jeaugey wrote: > > > I would go further on this : when available, putting the session directory > > in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum > > performance. > > > > Again, when using /dev/shm instead of the local /tmp filesystem, I get a > > consistent 1-5us latency improvement on a barrier at 32 cores (on a single > > node). So it may not be noticeable for everyone, but it seems faster in > > all cases. > > > > Sylvain > > > > On Mon, 17 May 2010, Paul H. Hargrove wrote: > > > > > Entry looks good, but could probably use an additional sentence or two > > > like: > > > > > > On diskless nodes running Linux, use of /dev/shm may be an option if > > > supported by your distribution. This will use an in-memory file system > > > for > > > the session directory, but will NOT result in a doubling of the memory > > > consumed for the shared memory file (i.e. file system "blocks" and memory > > > "pages" share a single instance). > > > > > > -Paul > > > > > > Jeff Squyres wrote: > > >> How's this? > > >> > > >> http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance > > >> > > >> What's the advantage of /dev/shm? (I don't know anything about /dev/shm) > > >> > > >> > > >> On May 17, 2010, at 4:08 AM, Sylvain Jeaugey wrote: > > >> > > >> > > >>> I agree with Paul on the fact that a FAQ update would be great on this > > >>> subject. /dev/shm seems a good place to put the temporary files (when > > >>> available, of course). > > >>> > > >>> Putting files in /dev/shm also showed better performance on our systems, > > >>> even with /tmp on a local disk. > > >>> > > >>> Sylvain > > >>> > > >>> On Sun, 16 May 2010, Paul H. Hargrove wrote: > > >>> > > >>> > > >>>> If I google "ompi sm btl performance" the top match is > > >>>> http://www.open-mpi.org/faq/?category=sm > > >>>> > > >>>> I scanned the entire page from top to bottom and don't see any > > >>>> questions > > >>>> of > > >>>> the form > > >>>> Why is SM performance slower than ...? > > >>>> > > >>>> The words "NFS", "network", "file system" or "filesystem" appear > > >>>> nowhere > > >>>> on > > >>>> the page. The closest I could find is > > >>>> > > >>>>> 7. Where is the file that sm will mmap in? > > >>>>> > > >>>>> The file will be in the OMPI session directory, which is typically > > >>>>> something like /tmp/openmpi-sessions-myusername@mynodename* . The file > > >>>>> itself will have the name shared_mem_pool.mynodename. For example, the > > >>>>> full > > >>>>> path could be > > >>>>> /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0. > > >>>>> > > >>>>> To place the session directory in a non-default location, use the MCA > > >>>>> parameter orte_tmpdir_base. > > >>>>> > > >>>> which says nothing about where one should or should not place the > > >>>> session > > >>>> directory. > > >>>> > > >>>> Not having read the entire FAQ from start to end, I will not contradict > > >>>> Ralph's claim that the "your SM performance might suck if you put the > > >>>> session > > >>>> directory on a remote filesystem" FAQ entry does exist, but I will > > >>>> assert > > >>>> that I did not find it in the SM section of the FAQ. I tried google on > > >>>> "ompi > > >>>> session directory" and "ompi orte_tmpdir_base" and still didn't find > > >>>> whatever > > >>>> entry Ralph is talking about. So, I think the average user with no > > >>>> clue > > >>>> about the relationship between the SM BLT and the session directory > > >>>> would > > >>>> need some help finding it. Therefore, I still feel an FAQ entry in the > > >>>> SM > > >>>> category is warranted, even if it just references whatever entry Ralph > > >>>> is > > >>>> referring to. > > >>>> > > >>>> -Paul > > >>>> > > >>>> Ralph Castain wrote: > > >>>> > > >>>>> We have had a FAQ on this for a long time...problem is, nobody reads > > >>>>> it > > >>>>> :-/ > > >>>>> > > >>>>> Glad you found the problem! > > >>>>> > > >>>>> On May 14, 2010, at 3:15 PM, Paul H. Hargrove wrote: > > >>>>> > > >>>>> > > >>>>> > > >>>>>> Oskar Enoksson wrote: > > >>>>>> > > >>>>>> > > >>>>>>> Christopher Samuel wrote: > > >>>>>>> > > >>>>>>> > > >>>>>>>> Subject: Re: [OMPI devel] Very poor performance with btl sm on twin > > >>>>>>>> nehalem servers with Mellanox ConnectX installed > > >>>>>>>> To: de...@open-mpi.org > > >>>>>>>> Message-ID: > > >>>>>>>> > > >>>>>>>> <d45958078cd65c429557b4c5f492b6a60770e...@is-ex-bev3.unimelb.edu.au> > > >>>>>>>> Content-Type: text/plain; charset="iso-8859-1" > > >>>>>>>> > > >>>>>>>> On 13/05/10 20:56, Oskar Enoksson wrote: > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>> The problem is that I get very bad performance unless I > > >>>>>>>>> explicitly exclude the "sm" btl and I can't figure out why. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>> Recently someone reported issues which were traced back to > > >>>>>>>> the fact that the files that sm uses for mmap() were in a > > >>>>>>>> /tmp which was NFS mounted; changing the location where their > > >>>>>>>> files were kept to another directory with the orte_tmpdir_base > > >>>>>>>> MCA parameter fixed that issue for them. > > >>>>>>>> > > >>>>>>>> Could it be similar for yourself ? > > >>>>>>>> > > >>>>>>>> cheers, > > >>>>>>>> Chris > > >>>>>>>> > > >>>>>>>> > > >>>>>>> That was exactly right, as you guessed these are diskless nodes that > > >>>>>>> mounts the root filesystem over NFS. > > >>>>>>> > > >>>>>>> Setting orte_tmpdir_base to /dev/shm and btl_sm_num_fifos=9 and then > > >>>>>>> running mpi_stress on eight cores measures speeds of 1650MB/s for > > >>>>>>> both > > >>>>>>> 1MB messages and 1600MB/s for 10kB messages. > > >>>>>>> > > >>>>>>> Thanks! > > >>>>>>> /Oskar > > >>>>>>> > > >>>>>>> _______________________________________________ > > >>>>>>> devel mailing list > > >>>>>>> de...@open-mpi.org > > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>>>>>> > > >>>>>>> > > >>>>>> Sounds like a new FAQ entry is warranted. > > >>>>>> > > >>>>>> -Paul > > >>>>>> > > >>>>>> -- > > >>>>>> Paul H. Hargrove phhargr...@lbl.gov > > >>>>>> Future Technologies Group > > >>>>>> HPC Research Department Tel: +1-510-495-2352 > > >>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >>>>>> > > >>>>>> _______________________________________________ > > >>>>>> devel mailing list > > >>>>>> de...@open-mpi.org > > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>>>>> > > >>>>>> > > >>>>> _______________________________________________ > > >>>>> devel mailing list > > >>>>> de...@open-mpi.org > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>>>> > > >>>>> > > >>>> -- > > >>>> Paul H. Hargrove phhargr...@lbl.gov > > >>>> Future Technologies Group Tel: +1-510-495-2352 > > >>>> HPC Research Department Fax: +1-510-486-6900 > > >>>> Lawrence Berkeley National Laboratory > > >>>> _______________________________________________ > > >>>> devel mailing list > > >>>> de...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>>> > > >>>> > > >>>> > > >>> _______________________________________________ > > >>> devel mailing list > > >>> de...@open-mpi.org > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>> > > >>> > > >> > > >> > > >> > > > > > > > > > -- > > > Paul H. Hargrove phhargr...@lbl.gov > > > Future Technologies Group Tel: +1-510-495-2352 > > > HPC Research Department Fax: +1-510-486-6900 > > > Lawrence Berkeley National Laboratory > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/