I have the branch complete for executing this - please see

https://bitbucket.org/rhc/ompi-scale

Timeout set to Feb 4th after that week's telecon


On Jan 17, 2014, at 9:57 AM, Ralph Castain <r...@open-mpi.org> wrote:

> After discussion on the telecon, we decided to:
> 
> 1. let the modex be non-blocking so we can fall thru - only when the 
> corresponding MCA param is set!
> 
> 2. do not modify the modex_recv to add the callback as the MPI layer really 
> doesn't know how to handle this in an async fashion. Modifying that behavior 
> would be difficult and could wind up impacting the critical path - something 
> we may decide to look into more at a later time
> 
> So we will block in a call to modex_recv until the info for that particular 
> process can be obtained. I'll add a timeout feature (via yet another MCA 
> param) so we can gracefully recover if the remote proc never answers for some 
> reason.
> 
> Will provide an update when this is ready
> 
> 
> On Jan 13, 2014, at 3:00 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> What I want to do is make the current "modex" become a no-op, which means we 
>> have a lazy modex. As I noted in my commit message, this scales horribly if 
>> we don't, hence the MCA param requirement so people don't do this unless 
>> their BTL/MTLs will support it.
>> 
>> What I found when playing with that arrangement is that a BTL/MTL is going 
>> to need or want data at first message, but that data may not be available 
>> yet because that particular remote proc hasn't registered all of its modex 
>> data yet. A beautiful race condition. So I was forced to block everyone at 
>> "modex" just to ensure the remote data was available at time of request.
>> 
>> If I remove the global "barrier" requirement, then I didn't want to "block" 
>> on modex_recv as this is done on a per-proc basis. Even though one proc 
>> isn't ready to return the data, another might be - and so I'd let you queue 
>> up as many modex_recv calls as you like, resolving each of them as data 
>> becomes available. This leaves the MPI layer free to send a message as soon 
>> as the target remote proc is ready, without waiting for some other proc to 
>> register its modex info.
>> 
>> Make sense?
>> 
>> 
>> 
>> On Mon, Jan 13, 2014 at 12:05 PM, Barrett, Brian W <bwba...@sandia.gov> 
>> wrote:
>> Is there any place that this can actually be used?  It's a fairly large
>> change to the RTE interface (which we should try to keep stable), and I
>> can't convince myself that it's useful; in general, if a BTL or MTL is
>> asking for a piece of data, the MPI library is stuck until that data's
>> available.  I can see doing some lazy approach, but I can't see making the
>> modex_recv call non-blocking.
>> 
>> Brian
>> 
>> On 1/11/14 9:28 PM, "Ralph Castain" <r...@open-mpi.org> wrote:
>> 
>> >NOTE:  This will involve a change to the MPI-RTE interface
>> >
>> >WHAT:  Modify modex_recv to add a callback function that will return the
>> >requested data when it is available
>> >
>> >WHY:    Enable faster startup on large scale systems by eliminating the
>> >current mandatory modex barrier during MPI_Init
>> >
>> >HOW:    The ompi_modex_recv functions will have callback function and
>> >(void*)cbdata arguments added to them.
>> >              An ompi_modex_recv_t struct will be defined that includes a
>> >pointer to the returned data plus a "bool active"
>> >              that can be used to detect when the data has been returned
>> >if blocking is required.
>> >
>> >              When a modex_recv is issued, ORTE will check for the
>> >presence of the requested data and immediately
>> >              issue a callback if the data is available. If the data is
>> >not available, then ORTE will request the data from
>> >              the remote process, and execute the callback when the
>> >remote process returns it.
>> >
>> >              The current behavior of a blocking modex barrier will
>> >remain the default - the new behavior will only take affect
>> >               if specifically requested by the user via MCA param. With
>> >this new behavior, the current call to "modex" in
>> >               MPI_Init will become a "no-op" when the processes are
>> >launched via mpirun - this will be executed in ORTE
>> >               so that other RTEs that do not wish to support async modex
>> >behavior are not impacted.
>> >
>> >WHEN:   No hurry on this as it is intended for 1.9, so let's say mid Feb.
>> >Info on a branch will be made available in
>> >               the near future.
>> >
>> >
>> >_______________________________________________
>> >devel mailing list
>> >de...@open-mpi.org
>> >http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> 
>> 
>> --
>>   Brian W. Barrett
>>   Scalable System Software Group
>>   Sandia National Laboratories
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 

Reply via email to