Ralph and I talked about this on the phone a bit this morning.  There's several 
complicating factors in using /dev/shm (aren't there always? :-) ).

0. Note that anything in /dev/shm will need to have session-directory-like 
semantics: there needs to be per-user and per-job characteristics (e.g., if the 
same user launches multiple jobs on the same node, etc.).

1. It is not necessarily a good idea to put the entire session directory in 
/dev/shm.  It's not just the shared memory files that go in the session 
directory; there's a handful of other meta data files that go in there as well. 
 Those files don't take up much space, but it still feels wrong to put anything 
other that shared memory files in there.  Indeed, checkpoint files and filem 
files can go in there -- these can eat up lots of space (RAM).  

2. /dev/shm may not be configured right, and/or there are possible /dev/shm 
configurations where you *do* use twice the memory (Ralph cited an example of a 
nameless organization that had exactly this problem -- we don't know if this 
was a misconfiguration or whether it was done on purpose for some reason).  I 
don't know if kernel version comes into play here, too (e.g., if older Linux 
kernel versions did double the memory, or somesuch).  So it's not necessarily a 
slam dunk that you *always* want to do this.

3. The session directory has "best effort" cleanup at the end of the job:

- MPI jobs (effectively) rm -rf the session directory
- The orted (effectively) rm -rf's the session directory

But neither of these are *guaranteed* -- for example, if the resource manager 
kills the job with extreme prejudice, the session directory can be left around. 
 Where possible, ORTE tries to put the session directory in a resource manager 
job-specific-temp directory so that the resource manager itself whacks the 
session directory at the end of the job.  But this isn't always the case.

So the session directory has 2 levels of attempted cleanup (MPI procs and 
orted), and sometimes a 3rd (the resource manager).

3a. If the session directory is in /dev/shm, we get the 2 levels but definitely 
not the 3rd (note: I don't think that putting the session directory is a good 
idea, per #1 -- I'm just being complete).

3b. If the shared memory files are outside the session directory, we don't get 
any of the additional cleanup without adding some additional infrastructure -- 
possibly into orte/util/session_dir.* (e.g., add /dev/shm as a secondary 
session directory root).  This would allow us to effect session directory-like 
semantics inside /dev/shm. 

4. But even with 2 levels of possible cleanup, not having the resource manager 
cleanup can be quite disastrous if shared memory is left around after a job is 
forcibly terminated.  Sysadmins can do stuff like rm -rf /dev/shm (or whatever) 
between jobs to guarantee cleanup, but it would be extra steps required outside 
of OMPI.  

--> This seems to imply that using /dev/shm should not be default behavior.

-----

All this being said, it seems like 3b is a reasonable way to go forward: extend 
orte/util/session_dir.* to allow for multiple session directory roots (somehow 
-- exact mechanism TBD).  Then both the MPI processes and the orted will try to 
clean up both the real session directory and /dev/shm.  Both roots will use the 
same per user/per job kinds of characteristics that the session dir already 
has.  

Then we can extend the MCA param orte_tmpdir_base to accept a comma-delimited 
list of roots.  It still defaults to /tmp, but a sysadmin can set it to be 
/tmp,/dev/shm (or whatever) if they want to use /dev/shm.  OMPI will still do 
"best effort" cleanup of /dev/shm, but it's the sysadmin's responsibility to 
*guarantee* its cleanup after a job ends, etc.

Thoughts?


On May 18, 2010, at 4:09 AM, Sylvain Jeaugey wrote:

> I would go further on this : when available, putting the session directory
> in a tmpfs filesystem (e.g. /dev/shm) should give you the maximum
> performance.
> 
> Again, when using /dev/shm instead of the local /tmp filesystem, I get a
> consistent 1-5us latency improvement on a barrier at 32 cores (on a single
> node). So it may not be noticeable for everyone, but it seems faster in
> all cases.
> 
> Sylvain
> 
> On Mon, 17 May 2010, Paul H. Hargrove wrote:
> 
> > Entry looks good, but could probably use an additional sentence or two like:
> >
> > On diskless nodes running Linux, use of /dev/shm may be an option if
> > supported by your distribution.  This will use an in-memory file system for
> > the session directory, but will NOT result in a doubling of the memory
> > consumed for the shared memory file (i.e. file system "blocks" and memory
> > "pages" share a single instance).
> >
> > -Paul
> >
> > Jeff Squyres wrote:
> >> How's this?
> >>
> >>     http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance
> >>
> >> What's the advantage of /dev/shm?  (I don't know anything about /dev/shm)
> >>
> >>
> >> On May 17, 2010, at 4:08 AM, Sylvain Jeaugey wrote:
> >>
> >>
> >>> I agree with Paul on the fact that a FAQ update would be great on this
> >>> subject. /dev/shm seems a good place to put the temporary files (when
> >>> available, of course).
> >>>
> >>> Putting files in /dev/shm also showed better performance on our systems,
> >>> even with /tmp on a local disk.
> >>>
> >>> Sylvain
> >>>
> >>> On Sun, 16 May 2010, Paul H. Hargrove wrote:
> >>>
> >>>
> >>>> If I google "ompi sm btl performance" the top match is
> >>>>  http://www.open-mpi.org/faq/?category=sm
> >>>>
> >>>> I scanned the entire page from top to bottom and don't see any questions
> >>>> of
> >>>> the form
> >>>>   Why is SM performance slower than ...?
> >>>>
> >>>> The words "NFS", "network", "file system" or "filesystem" appear nowhere
> >>>> on
> >>>> the page.  The closest I could find is
> >>>>
> >>>>> 7. Where is the file that sm will mmap in?
> >>>>>
> >>>>> The file will be in the OMPI session directory, which is typically
> >>>>> something like /tmp/openmpi-sessions-myusername@mynodename* . The file
> >>>>> itself will have the name shared_mem_pool.mynodename. For example, the
> >>>>> full
> >>>>> path could be
> >>>>> /tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0.
> >>>>>
> >>>>> To place the session directory in a non-default location, use the MCA
> >>>>> parameter orte_tmpdir_base.
> >>>>>
> >>>> which says nothing about where one should or should not place the session
> >>>> directory.
> >>>>
> >>>> Not having read the entire FAQ from start to end, I will not contradict
> >>>> Ralph's claim that the "your SM performance might suck if you put the
> >>>> session
> >>>> directory on a remote filesystem" FAQ entry does exist, but I will assert
> >>>> that I did not find it in the SM section of the FAQ.  I tried google on
> >>>> "ompi
> >>>> session directory" and "ompi orte_tmpdir_base" and still didn't find
> >>>> whatever
> >>>> entry Ralph is talking about.  So, I think the average user with no clue
> >>>> about the relationship between the SM BLT and the session directory would
> >>>> need some help finding it.  Therefore, I still feel an FAQ entry in the
> >>>> SM
> >>>> category is warranted, even if it just references whatever entry Ralph is
> >>>> referring to.
> >>>>
> >>>> -Paul
> >>>>
> >>>> Ralph Castain wrote:
> >>>>
> >>>>> We have had a FAQ on this for a long time...problem is, nobody reads it
> >>>>> :-/
> >>>>>
> >>>>> Glad you found the problem!
> >>>>>
> >>>>> On May 14, 2010, at 3:15 PM, Paul H. Hargrove wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Oskar Enoksson wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Christopher Samuel wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Subject: Re: [OMPI devel] Very poor performance with btl sm on twin
> >>>>>>>>   nehalem servers with Mellanox ConnectX installed
> >>>>>>>> To: de...@open-mpi.org
> >>>>>>>> Message-ID:
> >>>>>>>>   
> >>>>>>>> <d45958078cd65c429557b4c5f492b6a60770e...@is-ex-bev3.unimelb.edu.au>
> >>>>>>>> Content-Type: text/plain; charset="iso-8859-1"
> >>>>>>>>
> >>>>>>>> On 13/05/10 20:56, Oskar Enoksson wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> The problem is that I get very bad performance unless I
> >>>>>>>>> explicitly exclude the "sm" btl and I can't figure out why.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>> Recently someone reported issues which were traced back to
> >>>>>>>> the fact that the files that sm uses for mmap() were in a
> >>>>>>>> /tmp which was NFS mounted; changing the location where their
> >>>>>>>> files were kept to another directory with the orte_tmpdir_base
> >>>>>>>> MCA parameter fixed that issue for them.
> >>>>>>>>
> >>>>>>>> Could it be similar for yourself ?
> >>>>>>>>
> >>>>>>>> cheers,
> >>>>>>>> Chris
> >>>>>>>>
> >>>>>>>>
> >>>>>>> That was exactly right, as you guessed these are diskless nodes that
> >>>>>>> mounts the root filesystem over NFS.
> >>>>>>>
> >>>>>>> Setting orte_tmpdir_base to /dev/shm and btl_sm_num_fifos=9 and then
> >>>>>>> running mpi_stress on eight cores measures speeds of 1650MB/s for both
> >>>>>>> 1MB messages and 1600MB/s for 10kB messages.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>> /Oskar
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> devel mailing list
> >>>>>>> de...@open-mpi.org
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>>
> >>>>>>>
> >>>>>> Sounds like a new FAQ entry is warranted.
> >>>>>>
> >>>>>> -Paul
> >>>>>>
> >>>>>> --
> >>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
> >>>>>> Future Technologies Group
> >>>>>> HPC Research Department                   Tel: +1-510-495-2352
> >>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> de...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>
> >>>>>
> >>>> --
> >>>> Paul H. Hargrove                          phhargr...@lbl.gov
> >>>> Future Technologies Group                 Tel: +1-510-495-2352
> >>>> HPC Research Department                   Fax: +1-510-486-6900
> >>>> Lawrence Berkeley National Laboratory
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>>
> >>
> >>
> >>
> >
> >
> > --
> > Paul H. Hargrove                          phhargr...@lbl.gov
> > Future Technologies Group                 Tel: +1-510-495-2352
> > HPC Research Department                   Fax: +1-510-486-6900
> > Lawrence Berkeley National Laboratory
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to