Re: [OMPI devel] Modex and others

Jeff Squyres Fri, 14 Nov 2008 11:18:27 -0500

Hmm. I'm not sure the BML is the right place to do this. The BMLdoesn't know anything about the internals of the BTLs; it's just adispatch / multiplexer.

Unfortunately, few of us are in a good place to respond at the moment-- SC is next week and we're all hosed trying to get ready for that...



On Nov 13, 2008, at 1:07 PM, Leonardo Fialho wrote:

Ralph,

Very good document.
About the MPI layer (in case of fault), my idea is to give to BMLthe ability to handle BTL errors which occurs when a process die(and probably have been migrated), discovering the new location. Ithink that it is possible because the HNP request the restart forthe orted daemon, so it knows the new location of the faulty process.
Leonardo

Ralph Castain escribió:
If you look at the Dec meeting wiki, you will see that we aremoving quickly to a modex-less launch anyway. It won't be thedefault because it requires pre-discovery of the cluster's networkresources (for which we will provide a tool or method), but it willhelp resolve some of these problems.
Outside of that, I will have to leave it to the FT folks to figureout how to resolve modex situations. We have the ability to supportmultiple modex models (and already do), but I don't know if you cando what you describe or not - I'm not sure how the MPI layer willhandle that situation.
Ralph

On Nov 13, 2008, at 6:22 AM, Leonardo Fialho wrote:
Jeff,
I agree with your viewpoint, principally about the "reachability".But...
Looking from the FT viewpoint, sometimes (or some FTarchitectures), wants to recover an application process on othernode different from the first. In this case a new modex should becalled. It's fine for coordinated C/R, on the other hand, foruncoordinated C/R its not a good choice, I think. One more timethe tradeoffs...
A possible solution is to perform n-1 modex involving therecovered process and each one of the other processes... It'sbetter than an allgather modex? I don't now. I think not. And whatis the impact of a allgather modex while MPI thread is deliveringmessages? These answers about these questions could suggest that auncoordinated C/R is not possible on Open MPI.
Leonardo Fialho


Jeff Squyres escribió:
On Nov 7, 2008, at 10:18 AM, Leonardo Fialho wrote:
I understand that a process need to have the contact informationto send MPI messages to other processes, and modex permits it.My question is, why do not perform the contact exchange when itis necessary?
For example: in a M/W application, the workers does not needmore information than the masters contact info.
I think that it reduces the startup time, but increases the*first* communication between two peers.
FWIW, this is actually a pretty complex topic. There are many,many tradeoffs in terms of what performance do you want vs. whatfunctionality do you want. This subject has been discussed formany, many hours by the OMPI developers. :-)
The modex is performed during MPI_INIT; the v1.3 series' modex isquite a bit more efficient than the v1.2 series' modex. Themodex information comprises of several things, some of which areeither the contact info or "reachability" info of BTL modules.For the openib BTL, for example, port subnet ID's and MTU's arepassed in the modex, but LIDs don't need to be passed (in somecases) until two processes actually try to reach each other. Weuse the reachability information to determine whether a given BTLmodule *could* be used to connect to a remote peer. For example,if we get to the end of MPI_INIT and find a peer that cannot bereached, we abort (after hours of debate, we decided it wasbetter to abort right away when there was a peer that could notbe reached rather than abort only on attempted first contactbecause it could be a simple network/configuration error thatshould be detected immediately, rather than erroring out[potentially] long into a multi-hour run).
We have been discussing a "modex-less" startup for quite a while;this is actually one of the topics on the agenda for anengineering meeting that we're having December. modex-less isquite important for scalability to many thousands of processes,but other tradeoffs may be necessary to make this work (read:we've talked about modex-less for forever; we're finally likelyto do it in the near future because of some upcoming very verylarge scale machines at US DOE labs).
Does that make sense?
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] Modex and others

Reply via email to