Re: [OMPI devel] RFC: revamp topo framework

Jeff Squyres Fri, 30 Oct 2009 15:28:20 -0400

What George is describing is the Right answer, but it may take you alittle time.

FWIW: the complexity of a topo component is actually pretty low. It'sessentially a bunch of glue code (that I can probably mostly provide)and your mapping algorithms about how to reorder the communicator ranks.

To be clear: topo components are *ONLY* about re-ordering ranks in acommunicator -- the back-end of MPI_CART_CREATE and friends.

The BTL components that George is talking about are Byte TransferLayer components; essentially the brains behind MPI_SEND and friends.Open MPI has a per-device list of BTLs that can service each peer MPIprocess. Hence, if you're sending to another MPI process on the samehost, the first BTL in the list will be the shared memory BTL. Ifyou're sending to an MPI process on a different server that you'reconnected to via ethernet, the TCP BTL may be at the top of the list.And so on.


Is sounds like you actually want to make *two* components:

- topo: for reordering ranks during MPI_CART_CREATE and friends
- btl: use the underlying network primitives for sending when possible

As George indicated, the BTL module in each MPI process can determineduring startup which MPI process peers it can talk to. It can thentell the upper-layer routing algorithm "I can talk to peer processesX, Y, and Z -- I cannot talk to peer processes A, B, and C". Theupper-layer router (the PML module) will then put your BTL at the topof the list for peer processes X, Y, and Z, and will not put your BTLon the list ofr peer processes A, B, and C. For A, B, and C, otherBTLs will be used (e.g., TCP).


Does that make sense?

To answer your question from a prior mail: the unity topo component isused for the remapping of ranks in MPI_CART_CREATE. Look in ompi/mca/topo/unity/.




On Oct 30, 2009, at 11:53 AM, George Bosilca wrote:

Luigi,

The current way Open MPI is selecting the network to be used between
processes, match very well the first approach you proposed. As we
support multiple networks simultaneously, a BTL (the low level network
driver) can service only a subset of peers. All other communications
will automatically be redirected through another BTL (which has to be
available). In the past there were some attempts to route messages but
this code is not in the trunk.

   george.

On Oct 30, 2009, at 04:47 , Luigi Scorzato wrote:

>
>
> I am very interested in this, but let me explain in more details my
> present situation and goals.
>
> I am working in a group who is testing a system under development
> which is connected with both:
> - an ordinary all to all standard interface (where open-mpi is
> already available) but with limited performances and scalability.
> - a custom 3D torus network, with no mpi available, custom low-level
> communication primitives (under development), from which we expect
> higher performance and scalability.
>
>
> I have two approaches in mind:
>
> 1st approach.
> Use the standard network interface to setup MPI. However, through a
> precompilation step, redefine a few MPI_ functions (MPI_Send()
> MPI_Recv() and others) such that they call the torus primitives, if
> the communication is between nearest neighbors, and fall back into
> standard MPI through the standard interface if not. This can only
> work if I can choose the mpi-ranks of my system in a way that
> MPI_Cart_create() will generate coordinates consistent with the
> physical topology.
> ***There must be a place - somewhere in the open-mpi code - where
> the cartesian coordinates are chosen, presumably as a deterministic
> function of the mpi-ranks and the dimensions (as given by
> MPI_Dims_create). I expected it to be in MPI_Cart_create(). But I
> could not find it. Can anyone help?***
> This approach has obvious limitations of portability, besides
> requiring the availability of a fallback network, but it gives me
> full control of what I need to do, which is essential since my
> primary goal is to get a few important codes working in the new
> system asap.
>
>
> 2nd approach.
> Develop a new "torus" topo component, as explained by Jeff. This is
> certainly the *right* solution, but there are two problems:
> - because of my poor familiarity with the open-mpi source code, I am
> not able to estimate how long it will take me.
> - in a first phase, the torus primitives will not support all to all
> communications but only nearest neighbors ones. Hence, full
> portability is excluded anyway and/or a fallback network still
> needed. In other words, the topo component should be able to deal
> with two networks, and I have no idea of how much this will
> complicate things.
>
>
> I necessarily have to push the 1st approach, for the moment, but I
> am very much interested in studying the 2nd and if I see that it is
> realistic (given the limitations above) and safe, I may turn to it
> completely.
>
> thanks for your feedback and best regards, Luigi
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI devel] RFC: revamp topo framework

Reply via email to