The latter issue exists for even MPI jobs. Consider the case of a single
process job that comm_spawns a child job onto other nodes. The RTE will
launch daemons on the new nodes, and then broadcast the "launch procs"
command across all the daemons (this is done to exploit a scalable comm
procedure). Thus, the daemon on the initial node will see the launch
command, but will know it is not participating and hence take no action.

So we're doing something that is inherently non-scalable to take
advantage of scalable routines? It seems that in most cases we only want
to send the info to the daemons that need it, even if this means
unicasting the message.

I guess I don't quite understand the problem.

Tim

Ralph H Castain wrote:
III. Collective communications across daemons
A few months ago, we deliberately extended the abstraction between the RTE
and MPI layers to reduce their interaction. This has generally been
perceived as a good thing, but it did have a cost: namely, it increased the
communications required during launch. In prior OMPI versions, we took
advantage of tighter integration to aggregate RTE and MPI communications
required during startup - this was lost in the abstraction effort.

We have since been working to reduce the resulting "abstraction penalty". We
have managed to achieve communication performance that scales linearly with
the number of nodes. Further improvements, though, depend upon our ability
to do quasi-collective communications in the RTE.

Collectives in the RTE are complicated by the current requirement to support
non-MPI applications (topic of another email), and by the fact that not
every node participates in a given operation. The former issue is reflected
in the fact that the RTE (and hence, the daemon) cannot know if the
application process is going to call Init or not - hence, the logic in the
daemon must not block on any communication during launch since the proc may
completely execute and terminate without ever calling Init. Thus, entering a
collective to, for example, collect RML contact info is problematic as that
info may never become available - and so, the HNP -cannot- enter a
collective call to wait for its arrival.

The latter issue exists for even MPI jobs. Consider the case of a single
process job that comm_spawns a child job onto other nodes. The RTE will
launch daemons on the new nodes, and then broadcast the "launch procs"
command across all the daemons (this is done to exploit a scalable comm
procedure). Thus, the daemon on the initial node will see the launch
command, but will know it is not participating and hence take no action.

If we now attempt to perform a collective communication (say, to collect RML
contact info), we face four interacting obstacles:

(a) the initial daemon isn't launching anything this time, and so won't know
it has to participate. This can obviously be resolved since it will
certainly know that a launch is being conducted, so we could have it simply
go ahead and call the appropriate collective at that time;

(b) the launch of the local procs is conducted asynchronously - hence, no
daemon can know when another daemon has completed the launch and thus is
ready to communicate;

(c) the failure of any local launch can generate an error response back to
the daemons with orders to kill their procs, exit, or other things. The
daemons must, therefore, not be in blocking communication calls as this will
prevent them from responding as directed; and

(d) the daemons may not be fully connected - hence, any collective must
"follow" the communication topology.

What we could use is a quasi-collective "gather" based on non-blocking
receives that preserves the daemons' ability to respond to unexpected
commands such as "kill/exit". If someone is interested in working on this,
please contact me for a fuller description of the problem.

Ralph


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to