On Nov 3, 2009, at 3:40 AM, Luigi Scorzato wrote:
This defines the precise relation between ranks and coordinates. Once I know this, I do not even need to write a topo component, because I can define the ranks of my computing nodes in a rankfile in order that they get the coordinates that they need physically.
Fair enough. A topo component would make it unnecessary to lay out your processes in a specific order because it could (hypothetically) understand your physical topology and re-order the ranks accordingly.
A different issue is the BTL component. This is actually where my approach 1 and 2 differ (my previous distinction was confusing, due to my lack of understanding of the distinction between topo and btl components). In the 1st approach I would redefine some crucial (for my code) MPI functions in a way that they call the low level torus primitives, when the communication occurs between nearest neighbors, and fall back to open-mpi functions otherwise. The 2nd approach would be to develop our torus-btl. The fact that one can choose a "priority list of networks" is definitely great and dissipates my worries about the feasibility of the 2nd approach in my case. The only remaining question is whether I can get familiar with btl stuff fast enough. What do you suggest me to read in order to learn quickly how to create a BTL component?
The BTL is a bit more complicated than topo -- topo is actually pretty straightforward. BTL is a dumb byte-pusher that is controlled by an upper-level framework: the Point-to-point Messaging Layer (PML). The PML effects the semantics of the MPI point-to-point communications; PML components are the back-ends to MPI_SEND and friends. The PML initializes BTLs during MPI_INIT and builds up the priority lists of networks, etc. Then during MPI_SEND (etc.), the PML uses this information to decide what to do with messages -- fragment them over multiple BTLs, etc. It then calls the BTL modules in question to actually do the send. On receive, the BTLs make upcalls to the PML saying "here's a fragment; you handle it".
Hence, in this way, the BTLs are dumb byte pushers -- they simply send and receive to individual peers (without any MPI semantics at all) and give all the fragments they receive to the PML, who then effects all the MPI semantics.
Read ompi/mca/btl/btl.h and ompi/mca/pml/pml.h for the details of the interfaces.
Are the network primitives of your network like TCP (reads and writes can partially complete), or are they like Myrinet / IB (messages are read and written discretely, potentially also starting reads and writes and later receiving completion calls indicating that they finished)?
-- Jeff Squyres jsquy...@cisco.com