Ralph,
Very good document.
About the MPI layer (in case of fault), my idea is to give to BML the
ability to handle BTL errors which occurs when a process die (and
probably have been migrated), discovering the new location. I think that
it is possible because the HNP request the restart for the orted daemon,
so it knows the new location of the faulty process.
Leonardo
Ralph Castain escribió:
If you look at the Dec meeting wiki, you will see that we are moving
quickly to a modex-less launch anyway. It won't be the default because
it requires pre-discovery of the cluster's network resources (for
which we will provide a tool or method), but it will help resolve some
of these problems.
Outside of that, I will have to leave it to the FT folks to figure out
how to resolve modex situations. We have the ability to support
multiple modex models (and already do), but I don't know if you can do
what you describe or not - I'm not sure how the MPI layer will handle
that situation.
Ralph
On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote:
Jeff,
I agree with your viewpoint, principally about the "reachability".
But...
Looking from the FT viewpoint, sometimes (or some FT architectures),
wants to recover an application process on other node different from
the first. In this case a new modex should be called. It's fine for
coordinated C/R, on the other hand, for uncoordinated C/R its not a
good choice, I think. One more time the tradeoffs...
A possible solution is to perform n-1 modex involving the recovered
process and each one of the other processes... It's better than an
allgather modex? I don't now. I think not. And what is the impact of
a allgather modex while MPI thread is delivering messages? These
answers about these questions could suggest that a uncoordinated C/R
is not possible on Open MPI.
Leonardo Fialho
Jeff Squyres escribió:
On Nov 7, 2008, at 10:18 AM, Leonardo Fialho wrote:
I understand that a process need to have the contact information to
send MPI messages to other processes, and modex permits it. My
question is, why do not perform the contact exchange when it is
necessary?
For example: in a M/W application, the workers does not need more
information than the masters contact info.
I think that it reduces the startup time, but increases the *first*
communication between two peers.
FWIW, this is actually a pretty complex topic. There are many, many
tradeoffs in terms of what performance do you want vs. what
functionality do you want. This subject has been discussed for
many, many hours by the OMPI developers. :-)
The modex is performed during MPI_INIT; the v1.3 series' modex is
quite a bit more efficient than the v1.2 series' modex. The modex
information comprises of several things, some of which are either
the contact info or "reachability" info of BTL modules. For the
openib BTL, for example, port subnet ID's and MTU's are passed in
the modex, but LIDs don't need to be passed (in some cases) until
two processes actually try to reach each other. We use the
reachability information to determine whether a given BTL module
*could* be used to connect to a remote peer. For example, if we get
to the end of MPI_INIT and find a peer that cannot be reached, we
abort (after hours of debate, we decided it was better to abort
right away when there was a peer that could not be reached rather
than abort only on attempted first contact because it could be a
simple network/configuration error that should be detected
immediately, rather than erroring out [potentially] long into a
multi-hour run).
We have been discussing a "modex-less" startup for quite a while;
this is actually one of the topics on the agenda for an engineering
meeting that we're having December. modex-less is quite important
for scalability to many thousands of processes, but other tradeoffs
may be necessary to make this work (read: we've talked about
modex-less for forever; we're finally likely to do it in the near
future because of some upcoming very very large scale machines at US
DOE labs).
Does that make sense?
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478