I was reminded this morning (by 2 people :-) ) that the sysv shmem stuff was 
initiated a long time ago as a workaround for many of these same issues 
(including the potential performance issues).

Sam's work is nearly complete; I think that -- at least on Linux -- the mmap 
performance issues can go away.  The cleanup issues will not go away; it still 
requires external help to *guarantee* that shared memory IDs are removed after 
the job has completed.


On May 18, 2010, at 8:45 AM, Jeff Squyres (jsquyres) wrote:

> Ralph and I talked about this on the phone a bit this morning.  There's 
> several complicating factors in using /dev/shm (aren't there always? :-) ).
> 
> 0. Note that anything in /dev/shm will need to have session-directory-like 
> semantics: there needs to be per-user and per-job characteristics (e.g., if 
> the same user launches multiple jobs on the same node, etc.).
> 
> 1. It is not necessarily a good idea to put the entire session directory in 
> /dev/shm.  It's not just the shared memory files that go in the session 
> directory; there's a handful of other meta data files that go in there as 
> well.  Those files don't take up much space, but it still feels wrong to put 
> anything other that shared memory files in there.  Indeed, checkpoint files 
> and filem files can go in there -- these can eat up lots of space (RAM). 
> 
> 2. /dev/shm may not be configured right, and/or there are possible /dev/shm 
> configurations where you *do* use twice the memory (Ralph cited an example of 
> a nameless organization that had exactly this problem -- we don't know if 
> this was a misconfiguration or whether it was done on purpose for some 
> reason).  I don't know if kernel version comes into play here, too (e.g., if 
> older Linux kernel versions did double the memory, or somesuch).  So it's not 
> necessarily a slam dunk that you *always* want to do this.
> 
> 3. The session directory has "best effort" cleanup at the end of the job:
> 
> - MPI jobs (effectively) rm -rf the session directory
> - The orted (effectively) rm -rf's the session directory
> 
> But neither of these are *guaranteed* -- for example, if the resource manager 
> kills the job with extreme prejudice, the session directory can be left 
> around.  Where possible, ORTE tries to put the session directory in a 
> resource manager job-specific-temp directory so that the resource manager 
> itself whacks the session directory at the end of the job.  But this isn't 
> always the case.
> 
> So the session directory has 2 levels of attempted cleanup (MPI procs and 
> orted), and sometimes a 3rd (the resource manager).
> 
> 3a. If the session directory is in /dev/shm, we get the 2 levels but 
> definitely not the 3rd (note: I don't think that putting the session 
> directory is a good idea, per #1 -- I'm just being complete).
> 
> 3b. If the shared memory files are outside the session directory, we don't 
> get any of the additional cleanup without adding some additional 
> infrastructure -- possibly into orte/util/session_dir.* (e.g., add /dev/shm 
> as a secondary session directory root).  This would allow us to effect 
> session directory-like semantics inside /dev/shm.
> 
> 4. But even with 2 levels of possible cleanup, not having the resource 
> manager cleanup can be quite disastrous if shared memory is left around after 
> a job is forcibly terminated.  Sysadmins can do stuff like rm -rf /dev/shm 
> (or whatever) between jobs to guarantee cleanup, but it would be extra steps 
> required outside of OMPI. 
> 
> --> This seems to imply that using /dev/shm should not be default behavior.
> 
> -----
> 
> All this being said, it seems like 3b is a reasonable way to go forward: 
> extend orte/util/session_dir.* to allow for multiple session directory roots 
> (somehow -- exact mechanism TBD).  Then both the MPI processes and the orted 
> will try to clean up both the real session directory and /dev/shm.  Both 
> roots will use the same per user/per job kinds of characteristics that the 
> session dir already has. 
> 
> Then we can extend the MCA param orte_tmpdir_base to accept a comma-delimited 
> list of roots.  It still defaults to /tmp, but a sysadmin can set it to be 
> /tmp,/dev/shm (or whatever) if they want to use /dev/shm.  OMPI will still do 
> "best effort" cleanup of /dev/shm, but it's the sysadmin's responsibility to 
> *guarantee* its cleanup after a job ends, etc.
> 
> Thoughts?
> 
> 
> On May 18, 2010, at 4:09 AM, Sylvain Jeaugey wrote:
> 
> > I would go further on this : when available, putting the session directory
> > in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum
> > performance.
> >
> > Again, when using /dev/shm instead of the local /tmp filesystem, I get a
> > consistent 1-5us latency improvement on a barrier at 32 cores (on a single
> > node). So it may not be noticeable for everyone, but it seems faster in
> > all cases.
> >
> > Sylvain
> >
> > On Mon, 17 May 2010, Paul H. Hargrove wrote:
> >
> > > Entry looks good, but could probably use an additional sentence or two 
> > > like:
> > >
> > > On diskless nodes running Linux, use of /dev/shm may be an option if
> > > supported by your distribution.  This will use an in-memory file system 
> > > for
> > > the session directory, but will NOT result in a doubling of the memory
> > > consumed for the shared memory file (i.e. file system "blocks" and memory
> > > "pages" share a single instance).
> > >
> > > -Paul
> > >
> > > Jeff Squyres wrote:
> > >> How's this?
> > >>
> > >>     http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance
> > >>
> > >> What's the advantage of /dev/shm?  (I don't know anything about /dev/shm)
> > >>
> > >>
> > >> On May 17, 2010, at 4:08 AM, Sylvain Jeaugey wrote:
> > >>
> > >>
> > >>> I agree with Paul on the fact that a FAQ update would be great on this
> > >>> subject. /dev/shm seems a good place to put the temporary files (when
> > >>> available, of course).
> > >>>
> > >>> Putting files in /dev/shm also showed better performance on our systems,
> > >>> even with /tmp on a local disk.
> > >>>
> > >>> Sylvain
> > >>>
> > >>> On Sun, 16 May 2010, Paul H. Hargrove wrote:
> > >>>
> > >>>
> > >>>> If I google "ompi sm btl performance" the top match is
> > >>>>  http://www.open-mpi.org/faq/?category=sm
> > >>>>
> > >>>> I scanned the entire page from top to bottom and don't see any 
> > >>>> questions
> > >>>> of
> > >>>> the form
> > >>>>   Why is SM performance slower than ...?
> > >>>>
> > >>>> The words "NFS", "network", "file system" or "filesystem" appear 
> > >>>> nowhere
> > >>>> on
> > >>>> the page.  The closest I could find is
> > >>>>
> > >>>>> 7. Where is the file that sm will mmap in?
> > >>>>>
> > >>>>> The file will be in the OMPI session directory, which is typically
> > >>>>> something like /tmp/openmpi-sessions-myusername@mynodename* . The file
> > >>>>> itself will have the name shared_mem_pool.mynodename. For example, the
> > >>>>> full
> > >>>>> path could be
> > >>>>> /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.
> > >>>>>
> > >>>>> To place the session directory in a non-default location, use the MCA
> > >>>>> parameter orte_tmpdir_base.
> > >>>>>
> > >>>> which says nothing about where one should or should not place the 
> > >>>> session
> > >>>> directory.
> > >>>>
> > >>>> Not having read the entire FAQ from start to end, I will not contradict
> > >>>> Ralph's claim that the "your SM performance might suck if you put the
> > >>>> session
> > >>>> directory on a remote filesystem" FAQ entry does exist, but I will 
> > >>>> assert
> > >>>> that I did not find it in the SM section of the FAQ.  I tried google on
> > >>>> "ompi
> > >>>> session directory" and "ompi orte_tmpdir_base" and still didn't find
> > >>>> whatever
> > >>>> entry Ralph is talking about.  So, I think the average user with no 
> > >>>> clue
> > >>>> about the relationship between the SM BLT and the session directory 
> > >>>> would
> > >>>> need some help finding it.  Therefore, I still feel an FAQ entry in the
> > >>>> SM
> > >>>> category is warranted, even if it just references whatever entry Ralph 
> > >>>> is
> > >>>> referring to.
> > >>>>
> > >>>> -Paul
> > >>>>
> > >>>> Ralph Castain wrote:
> > >>>>
> > >>>>> We have had a FAQ on this for a long time...problem is, nobody reads 
> > >>>>> it
> > >>>>> :-/
> > >>>>>
> > >>>>> Glad you found the problem!
> > >>>>>
> > >>>>> On May 14, 2010, at 3:15 PM, Paul H. Hargrove wrote:
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>> Oskar Enoksson wrote:
> > >>>>>>
> > >>>>>>
> > >>>>>>> Christopher Samuel wrote:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> Subject: Re: [OMPI devel] Very poor performance with btl sm on twin
> > >>>>>>>>   nehalem servers with Mellanox ConnectX installed
> > >>>>>>>> To: de...@open-mpi.org
> > >>>>>>>> Message-ID:
> > >>>>>>>>   
> > >>>>>>>> <d45958078cd65c429557b4c5f492b6a60770e...@is-ex-bev3.unimelb.edu.au>
> > >>>>>>>> Content-Type: text/plain; charset="iso-8859-1"
> > >>>>>>>>
> > >>>>>>>> On 13/05/10 20:56, Oskar Enoksson wrote:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> The problem is that I get very bad performance unless I
> > >>>>>>>>> explicitly exclude the "sm" btl and I can't figure out why.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>> Recently someone reported issues which were traced back to
> > >>>>>>>> the fact that the files that sm uses for mmap() were in a
> > >>>>>>>> /tmp which was NFS mounted; changing the location where their
> > >>>>>>>> files were kept to another directory with the orte_tmpdir_base
> > >>>>>>>> MCA parameter fixed that issue for them.
> > >>>>>>>>
> > >>>>>>>> Could it be similar for yourself ?
> > >>>>>>>>
> > >>>>>>>> cheers,
> > >>>>>>>> Chris
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>> That was exactly right, as you guessed these are diskless nodes that
> > >>>>>>> mounts the root filesystem over NFS.
> > >>>>>>>
> > >>>>>>> Setting orte_tmpdir_base to /dev/shm and btl_sm_num_fifos=9 and then
> > >>>>>>> running mpi_stress on eight cores measures speeds of 1650MB/s for 
> > >>>>>>> both
> > >>>>>>> 1MB messages and 1600MB/s for 10kB messages.
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>> /Oskar
> > >>>>>>>
> > >>>>>>> _______________________________________________
> > >>>>>>> devel mailing list
> > >>>>>>> de...@open-mpi.org
> > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>>>>
> > >>>>>>>
> > >>>>>> Sounds like a new FAQ entry is warranted.
> > >>>>>>
> > >>>>>> -Paul
> > >>>>>>
> > >>>>>> --
> > >>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
> > >>>>>> Future Technologies Group
> > >>>>>> HPC Research Department                   Tel: +1-510-495-2352
> > >>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> > >>>>>>
> > >>>>>> _______________________________________________
> > >>>>>> devel mailing list
> > >>>>>> de...@open-mpi.org
> > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>>>
> > >>>>>>
> > >>>>> _______________________________________________
> > >>>>> devel mailing list
> > >>>>> de...@open-mpi.org
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>>
> > >>>>>
> > >>>> --
> > >>>> Paul H. Hargrove                          phhargr...@lbl.gov
> > >>>> Future Technologies Group                 Tel: +1-510-495-2352
> > >>>> HPC Research Department                   Fax: +1-510-486-6900
> > >>>> Lawrence Berkeley National Laboratory
> > >>>> _______________________________________________
> > >>>> devel mailing list
> > >>>> de...@open-mpi.org
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>>
> > >>>>
> > >>>>
> > >>> _______________________________________________
> > >>> devel mailing list
> > >>> de...@open-mpi.org
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >
> > >
> > > --
> > > Paul H. Hargrove                          phhargr...@lbl.gov
> > > Future Technologies Group                 Tel: +1-510-495-2352
> > > HPC Research Department                   Fax: +1-510-486-6900
> > > Lawrence Berkeley National Laboratory
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to