Re: [OMPI devel] RTE Issue III: Collective communications across daemons

Ralph H Castain Wed, 5 Dec 2007 11:04:20 -0500

On 12/5/07 8:56 AM, "Tim Prins" <tpr...@cs.indiana.edu> wrote:

>> The latter issue exists for even MPI jobs. Consider the case of a single
>> process job that comm_spawns a child job onto other nodes. The RTE will
>> launch daemons on the new nodes, and then broadcast the "launch procs"
>> command across all the daemons (this is done to exploit a scalable comm
>> procedure). Thus, the daemon on the initial node will see the launch
>> command, but will know it is not participating and hence take no action.
> 
> So we're doing something that is inherently non-scalable to take
> advantage of scalable routines? It seems that in most cases we only want
> to send the info to the daemons that need it, even if this means
> unicasting the message.


Um...we are taking advantage of known scalable communication algorithms to
broadcast a message that in the typical case is required by all recipients.
It is only in comm_spawn that some participants in the broadcast may not
take action based on the message's contents.

The issue is more in terms of collectives going the other direction - i.e.,
collectively gathering info from across all the daemons and reporting it to
the HNP. This is required for synchronization - e.g., ensuring that we don't
return from a spawn function call until the procs are launched so the user
can know "it is okay to proceed".

The problem is, in the case of comm_spawn, some of the daemons -might- not
participate in the launch, but we would like them to participate in the
collective to return info to the HNP as part of a scalable comm algorithm.
Can it be done? I think so - it may just take a little tricky programming.
Just needs to be proven.

Hope that helps
Ralph

> 
> I guess I don't quite understand the problem.
> 
> Tim
> 
> Ralph H Castain wrote:
>> III. Collective communications across daemons
>> A few months ago, we deliberately extended the abstraction between the RTE
>> and MPI layers to reduce their interaction. This has generally been
>> perceived as a good thing, but it did have a cost: namely, it increased the
>> communications required during launch. In prior OMPI versions, we took
>> advantage of tighter integration to aggregate RTE and MPI communications
>> required during startup - this was lost in the abstraction effort.
>> 
>> We have since been working to reduce the resulting "abstraction penalty". We
>> have managed to achieve communication performance that scales linearly with
>> the number of nodes. Further improvements, though, depend upon our ability
>> to do quasi-collective communications in the RTE.
>> 
>> Collectives in the RTE are complicated by the current requirement to support
>> non-MPI applications (topic of another email), and by the fact that not
>> every node participates in a given operation. The former issue is reflected
>> in the fact that the RTE (and hence, the daemon) cannot know if the
>> application process is going to call Init or not - hence, the logic in the
>> daemon must not block on any communication during launch since the proc may
>> completely execute and terminate without ever calling Init. Thus, entering a
>> collective to, for example, collect RML contact info is problematic as that
>> info may never become available - and so, the HNP -cannot- enter a
>> collective call to wait for its arrival.
>> 
>> The latter issue exists for even MPI jobs. Consider the case of a single
>> process job that comm_spawns a child job onto other nodes. The RTE will
>> launch daemons on the new nodes, and then broadcast the "launch procs"
>> command across all the daemons (this is done to exploit a scalable comm
>> procedure). Thus, the daemon on the initial node will see the launch
>> command, but will know it is not participating and hence take no action.
>> 
>> If we now attempt to perform a collective communication (say, to collect RML
>> contact info), we face four interacting obstacles:
>> 
>> (a) the initial daemon isn't launching anything this time, and so won't know
>> it has to participate. This can obviously be resolved since it will
>> certainly know that a launch is being conducted, so we could have it simply
>> go ahead and call the appropriate collective at that time;
>> 
>> (b) the launch of the local procs is conducted asynchronously - hence, no
>> daemon can know when another daemon has completed the launch and thus is
>> ready to communicate;
>> 
>> (c) the failure of any local launch can generate an error response back to
>> the daemons with orders to kill their procs, exit, or other things. The
>> daemons must, therefore, not be in blocking communication calls as this will
>> prevent them from responding as directed; and
>> 
>> (d) the daemons may not be fully connected - hence, any collective must
>> "follow" the communication topology.
>> 
>> What we could use is a quasi-collective "gather" based on non-blocking
>> receives that preserves the daemons' ability to respond to unexpected
>> commands such as "kill/exit". If someone is interested in working on this,
>> please contact me for a fuller description of the problem.
>> 
>> Ralph
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>
Re: [OMPI devel] RTE Issue III: Collective communications across daemons

Reply via email to