Re: [OMPI devel] collective problems

Shipman, Galen M. Wed, 7 Nov 2007 23:27:53 -0500

The lengths we go to avoid progress :-)




On 11/7/07 10:19 PM, "Richard Graham" <rlgra...@ornl.gov> wrote:

> The real problem, as you and others have pointed out is the lack of
> predictable time slices for the progress engine to do its work, when relying
> on the ULP to make calls into the library...
> 
> Rich
> 
> 
> On 11/8/07 12:07 AM, "Brian Barrett" <brbar...@open-mpi.org> wrote:
> 
>> As it stands today, the problem is that we can inject things into the
>> BTL successfully that are not injected into the NIC (due to software
>> flow control).  Once a message is injected into the BTL, the PML marks
>> completion on the MPI request.  If it was a blocking send that got
>> marked as complete, but the message isn't injected into the NIC/NIC
>> library, and the user doesn't re-enter the MPI library for a
>> considerable amount of time, then we have a problem.
>> 
>> Personally, I'd rather just not mark MPI completion until a local
>> completion callback from the BTL.  But others don't like that idea, so
>> we came up with a way for back pressure from the BTL to say "it's not
>> on the wire yet".  This is more complicated than just not marking MPI
>> completion early, but why would we do something that helps real apps
>> at the expense of benchmarks?  That would just be silly!
>> 
>> Brian
>> 
>> On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:
>> 
>>> Does this mean that we don¹t have a queue to store btl level
>>> descriptors that
>>>  are only partially complete ?  Do we do an all or nothing with
>>> respect to btl
>>>  level requests at this stage ?
>>> 
>>> Seems to me like we want to mark things complete at the MPI level
>>> ASAP, and
>>>  that this proposal is not to do that  is this correct ?
>>> 
>>> Rich
>>> 
>>> 
>>> On 11/7/07 11:26 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:
>>> 
>>>> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:
>>>> 
>>>>>> Remember that this is all in the context of Galen's proposal for
>>>>>> btl_send() to be able to return NOT_ON_WIRE -- meaning that the
>>>> send
>>>>>> was successful, but it has not yet been sent (e.g., openib BTL
>>>>>> buffered it because it ran out of credits).
>>>>> 
>>>>> Sorry if I miss something obvious, but why does the PML has to be
>>>>> aware
>>>>> of the flow control situation of the BTL ? If the BTL cannot send
>>>>> something right away for any reason, it should be the
>>>> responsibility
>>>>> of
>>>>> the BTL to buffer it and to progress on it later.
>>>> 
>>>> 
>>>> That's currently the way it is.  But the BTL currently only has the
>>>> option to say two things:
>>>> 
>>>> 1. "ok, done!" -- then the PML will think that the request is
>>>> complete
>>>> 2. "doh -- error!" -- then the PML thinks that Something Bad
>>>> Happened(tm)
>>>> 
>>>> What we really need is for the BTL to have a third option:
>>>> 
>>>> 3. "not done yet!"
>>>> 
>>>> So that the PML knows that the request is not yet done, but will
>>>> allow
>>>> other things to progress while we're waiting for it to complete.
>>>> Without this, the openib BTL currently replies "ok, done!", even when
>>>> it has only buffered a message (rather than actually sending it out).
>>>> This optimization works great (yeah, I know...) except for apps that
>>>> don't dip into the MPI library frequently.  :-\
>>>> 
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] collective problems

Reply via email to