Re: [OMPI devel] RFC: sm Latency

Richard Graham Tue, 20 Jan 2009 21:33:29 -0500

On 1/20/09 2:08 PM, "Eugene Loh" <eugene....@sun.com> wrote:

> Richard Graham wrote:
>>  Re: [OMPI devel] RFC: sm Latency First, the performance improvements look
>> really nice.
>> A few questions:
>>   - How much of an abstraction violation does this introduce?
> Doesn't need to be much of an abstraction violation at all if, by that, we
> mean teaching the BTL about the match header.  Just need to make some choices
> and I flagged that one for better visibility.
> 
>>> >> I really don¹t see how teaching the btl about matching will help much (it
>>> will save a subroutine call).  As I understand
>>> >> the proposal you aim to selectively pull items out of the fifo¹s  this
>>> will break the fifo¹s, as they assume contiguous
>>> >> entries.  Logic to manage holes will need to be added.
> 
>> This looks like the btl needs to start ³knowing² about MPI level semantics.
> That's one option.  There are other options.
> 
>>> >> Such as ?
> 
>> Currently, the btl purposefully is ulp agnostic.
> What's ULP?
>>>  >>  Upper Level Protocol
> 
>> I ask for 2 reasons
>>        - you mention having the btl look at the match header (if I understood
>> correctly)
>>  
> Right, both to know if there is a match when the user had MPI_ANY_TAG and to
> extract values to populate the MPI_Status variable.  There are other
> alternatives, like calling back the PML.
>>        - not clear to me what you mean by returning the header to the list if
>> the irecv does not complete.  If it does not complete, why not just pass the
>> header back for further processing, if all this is happening at the pml level
>> ?
>>  
> I was trying to read the FIFO to see what's on there.  If it's something we
> can handle, we take it and handle it.  If it's too complicated, then we just
> balk and tell the upper layer that we're declining any possible action.  That
> just seemed to me to be the cleanest approach.
> 
>>> >> see the note above.  The fifo logic would have to be changed to manage
>>> non-contiguous entries.
> 
> Here's an analogy.  Let's say you have a house problem.  You don't know how
> bad it is.  You think you might have to hire an expensive contractor to do
> lots of work, but some local handyman thinks he can do it quickly and cheaply.
> So, you have the handyman look at the job to decide how involved it is.  Let's
> say the handyman discovers that it is, indeed, a big job.  How would you like
> things left at this point? Two options:
> 
> *) Handyman says this is a big job.  Bring in the expensive contractor and big
> equipment.  I left everything as I found it.  Or,
> 
> *) Handyman says, "I took apart the this-and-this and I bought a bunch of
> supplies.  I ripped out the south wall.  The water to the house is turned off.
> Etc."  You (and whoever has to come in to actually do the work) would probably
> prefer that nothing had been started.
> 
> I thought it was cleaner to go the "do the whole job or don't do any of it"
> approach.
>>   - The measurements seem to be very dual process specific.  Have you looked
>> at the impact of these changes on other applications at the same process
>> count ?  ³Real² apps would be interesting, but even hpl would be a good
>> start. 
>>  
> Many measurements are for np=2.  There are also np>2 HPCC pingpong results
> though.  (HPCC pingpong measures pingpong between two processes while np-2
> process sit in wait loops.)  HPCC also measures "ring" results... these are
> point-to-point with all np processes work.
> 
> HPL is pretty insensitive to point-to-point performance.  It either shows
> basically DGEMM performance or something is broken.
> 
> I haven't looked at "real" apps.
> 
> Let me be blunt about one thing:  much of this is motivated by simplistic
> (HPCC) benchmarks.  This is for two reasons:
> 
> 1) These benchmarks are highly visible.
> 2) It's hard to tune real apps when you know the primitives need work.
> 
> Looking at real apps is important and I'll try to get to that.
> 
>>> >> don¹t disagree here at all.  Just want to make sure that aiming at these
>>> important benchmarks does not
>>> >> harm what is really more important  the day to day use.
> 
>>   The current sm implementation is aimed only at small smp node count, which
>> was really the only relevant type of systems when this code was written 5
>> years ago.  For large core counts there is a rather simple change that could
>> be put in that is simple to implement, and will give you flat scaling for the
>> sort of tests you are running.  If you replace the fifo¹s with a single link
>> list per process in shared memory, with senders to this process adding match
>> envelopes atomically, with each process reading its own link list (multiple
>> writers and single reader in non-threaded situation) there will be only one
>> place to poll, regardless of the number of procs involved in the run.  One
>> still needs other optimizations to lower the absolute latency  perhaps what
>> you have suggested.  If one really has all N procs trying to write to the
>> same fifo at once, performance will stink because of contention, but most
>> apps don¹t have that behaviour.
>>  
> Okay.  Yes, I am a fan of that approach.  But:
> 
> *) Doesn't strike me as a "simple" change.
> 
>>> >> instead of a fifo_write (or what ever is is called), an entry is posted
>>> to the ³head² of a linked list, and the read is
>>> >> removing an entry from the list.  If one cares about memory locality, you
>>> need to return things to the appropiate
>>> >> list, which is implicit in the fifo.  More objects need to be allocated
>>> in shared memory.
> 
> *) Not sure this addresses all-to-all well.  E.g., let's say you post a
> receive for a particular source.  Do you then wade through a long FIFO to look
> for your match?
> 
>>> >> to pull things of the free list, you do need to look through what is on
>>> the queue.  If it is not the match you are
>>> >> looking for, just post it the the appropriate local list for later use,
>>> just like the matching logic does now.  As
>>> >> I mentioned this am, if you want, you don¹t have to have a single list
>>> per destination, you could have several lists,
>>> >> if you are concerned about too much contention.
> 
> What the RFC talks about is not the last SM development we'll ever need.  It's
> only supposed to be one step forward from where we are today.  The "single
> queue per receiver" approach has many advantages, but I think it's a different
> topic.
> 
>>> >> This is a big enough proposed change, that a call to describe this may be
>>> in place.  I will state up front I am against
>>> >> introducing MPI semantics into the btl.  Not against having that sort of
>>> option in the code base, but do want to
>>> >> preserve an option like the pml/btl abstraction.
> 
> Rich
>
Re: [OMPI devel] RFC: sm Latency

Reply via email to