Jeff,
I agree with your viewpoint, principally about the "reachability". But...
Looking from the FT viewpoint, sometimes (or some FT architectures), wants to
recover an application process on other node different from the first. In this
case a new modex should be called. It's fine for coordinated C/R, on the other
hand, for uncoordinated C/R its not a good choice, I think. One more time the
tradeoffs...
A possible solution is to perform n-1 modex involving the recovered process and
each one of the other processes... It's better than an allgather modex? I don't
now. I think not. And what is the impact of a allgather modex while MPI thread
is delivering messages? These answers about these questions could suggest that
a uncoordinated C/R is not possible on Open MPI.
Leonardo Fialho
Jeff Squyres escribió:
On Nov 7, 2008, at 10:18 AM, Leonardo Fialho wrote:
I understand that a process need to have the contact information to
send MPI messages to other processes, and modex permits it. My
question is, why do not perform the contact exchange when it is
necessary?
For example: in a M/W application, the workers does not need more
information than the masters contact info.
I think that it reduces the startup time, but increases the *first*
communication between two peers.
FWIW, this is actually a pretty complex topic. There are many, many
tradeoffs in terms of what performance do you want vs. what
functionality do you want. This subject has been discussed for many,
many hours by the OMPI developers. :-)
The modex is performed during MPI_INIT; the v1.3 series' modex is
quite a bit more efficient than the v1.2 series' modex. The modex
information comprises of several things, some of which are either the
contact info or "reachability" info of BTL modules. For the openib
BTL, for example, port subnet ID's and MTU's are passed in the modex,
but LIDs don't need to be passed (in some cases) until two processes
actually try to reach each other. We use the reachability information
to determine whether a given BTL module *could* be used to connect to
a remote peer. For example, if we get to the end of MPI_INIT and find
a peer that cannot be reached, we abort (after hours of debate, we
decided it was better to abort right away when there was a peer that
could not be reached rather than abort only on attempted first contact
because it could be a simple network/configuration error that should
be detected immediately, rather than erroring out [potentially] long
into a multi-hour run).
We have been discussing a "modex-less" startup for quite a while; this
is actually one of the topics on the agenda for an engineering meeting
that we're having December. modex-less is quite important for
scalability to many thousands of processes, but other tradeoffs may be
necessary to make this work (read: we've talked about modex-less for
forever; we're finally likely to do it in the near future because of
some upcoming very very large scale machines at US DOE labs).
Does that make sense?
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478