Agreed.

On 11/14/08 9:56 AM, "Ralph Castain" <r...@lanl.gov> wrote:

> 
> On Nov 14, 2008, at 7:41 AM, Richard Graham wrote:
> 
>>  Just a few comments:
>>    - not sure what sort of alternative memory approach is being considered.
>> The current approach was selected for two reasons:
>>      - If something like anonymous memory is being used, one can only inherit
>> access to the shared files, so one process needs
>>        set up the shared memory regions, and then fork() the procs that will
>> use it.  This usually implies that to do this portably,
>>        this needs to happen inside of MPI_Init(), so up to that stage only
>> one process runs on each host.  Also, unrelated procs can¹t
>>        access this memory ­ can¹t use this in the context of Fault Tolerance.
>>    - The approach used here is very efficient for small systems, so
>> alternatives should be added to what is in place here, so we
>>       don¹t loose the performance potential on small SMP¹s, which still
>> describes the vast majority of systems.
> 
> I concur - however, note that the segv occurred on a 4ppn system, which I
> think we would all agree constitutes a small SMP. I believe that the
> alternative memory approach needs to be a separate component, but I also
> believe that we need to modify the existing component so it doesn't segv if
> adequate memory isn't found.
> 
> Just my $.002
> 
>> 
>>  
>>  Rich
>>  
>>  
>>  On 11/14/08 9:22 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>  
>>  
>>> Ok.  Should be pretty easy to test/simulate to figure out what's going
>>>  on -- e.g., whether it's segv'ing in MPI_INIT or the first MPI_SEND.
>>>  
>>>  
>>>  On Nov 14, 2008, at 9:19 AM, Ralph Castain wrote:
>>>  
>>>>  > Until we do complete the switch, and for systems that do not support
>>>>  > the alternate type of shared memory (I believe it is only Linux?), I
>>>>  > truly believe we should do something nicer than segv.
>>>>  >
>>>>  > Just to clarify: I know the segv case was done with paffinity set,
>>>>  > and believe both cases were done that way. In the first case, I was
>>>>  > told that the segv hit when they did MPI_Send, but I did not
>>>>  > personally verify that claim - it could be that it hit during
>>>>  > maffinity binding if, as you suggest, we actually touch the page at
>>>>  > that time.
>>>>  >
>>>>  > Ralph
>>>>  >
>>>>  >
>>>>  >
>>>>  > On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote:
>>>>  >
>>>>>  >> It's been a looooong time since I've looked at the sm code; Eugene
>>>>>  >> has looked at it much more in-depth recently than I have.  But I'm
>>>>>  >> guessing we *haven't* checked this stuff to abort nicely in such
>>>>>  >> error conditions.  We might very well succeed in the mmap but then
>>>>>  >> segv later when the memory isn't actually available.  Perhaps we
>>>>>  >> should try to touch every page first to ensure that it's actually
>>>>>  >> there...?  (I'm pretty sure we do this when using paffinity to
>>>>>  >> ensure to maffinity bind memory to processors -- perhaps we're not
>>>>>  >> doing that in the !paffinity case?)
>>>>>  >>
>>>>>  >> Additionally, another solution might well be what Tim has long
>>>>>  >> advocated: switch to the other type of shared memory on systems
>>>>>  >> that support auto-pruning it when all processes die, and/or have
>>>>>  >> the orted kill it when all processes die.  Then a) we're not
>>>>>  >> dependent on the filesystem free space, and b) we're not writing
>>>>>  >> all the dirty pages to disk when the processes exit.
>>>>>  >>
>>>>>  >>
>>>>>  >>
>>>>>  >> On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
>>>>>  >>
>>>>>>  >>> Hi Eugene
>>>>>>  >>>
>>>>>>  >>> I too am interested - I think we need to do something about the sm
>>>>>>  >>> backing file situation as larger core machines are slated to
>>>>>>  >>> become more prevalent shortly.
>>>>>>  >>>
>>>>>>  >>> I appreciate your info on the sizes and controls. One other
>>>>>>  >>> question: what happens when there isn't enough memory to support
>>>>>>  >>> all this? Are we smart enough to detect this situation? Does the
>>>>>>  >>> sm subsystem quietly shut down? Warn and shut down? Segfault?
>>>>>>  >>>
>>>>>>  >>> I have two examples so far:
>>>>>>  >>>
>>>>>>  >>> 1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a single
>>>>>>  >>> node, 2ppn, with btl=openib,sm,self. The program started, but
>>>>>>  >>> segfaulted on the first MPI_Send. No warnings were printed.
>>>>>>  >>>
>>>>>>  >>> 2. again with a ramdisk, /tmp was reportedly set to 16MB
>>>>>>  >>> (unverified - some uncertainty, could be have been much larger).
>>>>>>  >>> OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.
>>>>>>  >>> The program ran to completion without errors or warning. I don't
>>>>>>  >>> know the communication pattern - could be no local comm was
>>>>>>  >>> performed, though that sounds doubtful.
>>>>>>  >>>
>>>>>>  >>> If someone doesn't know, I'll have to dig into the code and figure
>>>>>>  >>> out the response - just hoping that someone can spare me the pain.
>>>>>>  >>>
>>>>>>  >>> Thanks
>>>>>>  >>> Ralph
>>>>>>  >>>
>>>>>>  >>>
>>>>>>  >>> On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
>>>>>>  >>>
>>>>>>>  >>>> Ralph Castain wrote:
>>>>>>>  >>>>
>>>>>>>>  >>>>> As has frequently been commented upon at one time or another,
>>>>>>>>  >>>>> the  shared memory backing file can be quite huge. There used to
>>>>>>>>  >>>>> be a param  for controlling this size, but I can't find it in
>>>>>>>>  >>>>> 1.3 - or at least,  the name or method for controlling file size
>>>>>>>>  >>>>> has morphed into  something I don't recognize.
>>>>>>>>  >>>>>
>>>>>>>>  >>>>> Can someone more familiar with that subsystem point me to one or
>>>>>>>>  >>>>> more  params that will allow us to control the size of that
>>>>>>>>  >>>>> file? It is  swamping our systems and causing OMPI to segfault.
>>>>>>>  >>>>
>>>>>>>  >>>> Sounds like you've already gotten your answers, but I'll add my
>>>>>>>  >>>> $0.02 anyhow.
>>>>>>>  >>>>
>>>>>>>  >>>> The file size is the number of local processes (call it n) times
>>>>>>>  >>>> mpool_sm_per_peer_size (default 32M), but with a minimum of
>>>>>>>  >>>> mpool_sm_min_size (default 128M) and a maximum of
>>>>>>>  >>>> mpool_sm_max_size (default 2G?  256M?).  So, you can tweak those
>>>>>>>  >>>> parameters to control file size.
>>>>>>>  >>>>
>>>>>>>  >>>> Another issue is possibly how small a backing file you can get
>>>>>>>  >>>> away with.  That is, just forcing the file to be smaller may not
>>>>>>>  >>>> be enough since your job may no longer run.  The backing file
>>>>>>>  >>>> seems to be used mainly by:
>>>>>>>  >>>>
>>>>>>>  >>>> *) eager-fragment free lists:  We start with enough eager
>>>>>>>  >>>> fragments so that we could have two per connection.  So, you
>>>>>>>  >>>> could bump the sm eager size down if you need to shoehorn a job
>>>>>>>  >>>> into a very small backing file.
>>>>>>>  >>>>
>>>>>>>  >>>> *) large-fragment free lists:  We start with 8*n large
>>>>>>>  >>>> fragments.  If this term plagues you, you can bump the sm chunk
>>>>>>>  >>>> size down or reduce the value of 8 (using btl_sm_free_list_num, I
>>>>>>>  >>>> think).
>>>>>>>  >>>>
>>>>>>>  >>>> *) FIFOs:  The code tries to align a number of things on pagesize
>>>>>>>  >>>> boundaries, so you end up with about 3*n*n*pagesize overhead
>>>>>>>  >>>> here.  If this term is causing you problems, you're stuck (unless
>>>>>>>  >>>> you modify OMPI).
>>>>>>>  >>>>
>>>>>>>  >>>> I'm interested in this subject!  :^)
>>>>>>>  >>>> _______________________________________________
>>>>>>>  >>>> devel mailing list
>>>>>>>  >>>> de...@open-mpi.org
>>>>>>>  >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>  >>>
>>>>>>  >>> _______________________________________________
>>>>>>  >>> devel mailing list
>>>>>>  >>> de...@open-mpi.org
>>>>>>  >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>  >>
>>>>>  >>
>>>>>  >> --
>>>>>  >> Jeff Squyres
>>>>>  >> Cisco Systems
>>>>>  >>
>>>>>  >> _______________________________________________
>>>>>  >> devel mailing list
>>>>>  >> de...@open-mpi.org
>>>>>  >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>  >
>>>>  > _______________________________________________
>>>>  > devel mailing list
>>>>  > de...@open-mpi.org
>>>>  > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>  
>>>  
>>>  --
>>>  Jeff Squyres
>>>  Cisco Systems
>>>  
>>>  _______________________________________________
>>>  devel mailing list
>>>  de...@open-mpi.org
>>>  http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>  
>>>  
>>  
>>   _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to