Luigi, I tried a configuration on a small test cluster similar to your Approach 1, with interesting (promising) results. While the topology is deterministic, I found the actual performance is under-determined in practice - depending on the symmetry and partitioning of the tasks and the data.
Your second approach is understandable, as a generalized (sub-optimal) solution, but I ended up abandoning hard-coded/hard-wired topologies in favor of a more dynamical approach in order to improve efficiency and effectiveness of our compute fabric. This approach is dependent upon a context of several factors, such as concurrent schedules, priorities, existing configurations, specific task and data partitions, etc. I'm afraid I cannot be more specific at this time. There are myriad resulting topologies - some patterns already identified, some almost indescribable (at this time). The determining factor is usually the structure of the existing codes and particular data reduction/partition of the job. I found the hardware and topologies closely coupled with the existing software and data, which provides the constraint. Ken > -----Original Message----- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Luigi Scorzato > Sent: Friday, October 30, 2009 2:47 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] RFC: revamp topo framework > > > > I am very interested in this, but let me explain in more > details my present situation and goals. > > I am working in a group who is testing a system under > development which is connected with both: > - an ordinary all to all standard interface (where open-mpi > is already available) but with limited performances and scalability. > - a custom 3D torus network, with no mpi available, custom > low-level communication primitives (under development), from > which we expect higher performance and scalability. > > > I have two approaches in mind: > > 1st approach. > Use the standard network interface to setup MPI. However, > through a precompilation step, redefine a few MPI_ functions > (MPI_Send() > MPI_Recv() and others) such that they call the torus > primitives, if the communication is between nearest > neighbors, and fall back into standard MPI through the > standard interface if not. This can only work if I can choose > the mpi-ranks of my system in a way that > MPI_Cart_create() will generate coordinates consistent with > the physical topology. > ***There must be a place - somewhere in the open-mpi code - > where the cartesian coordinates are chosen, presumably as a > deterministic function of the mpi-ranks and the dimensions > (as given by MPI_Dims_create). I expected it to be in > MPI_Cart_create(). But I could not find it. Can anyone > help?*** This approach has obvious limitations of > portability, besides requiring the availability of a fallback > network, but it gives me full control of what I need to do, > which is essential since my primary goal is to get a few > important codes working in the new system asap. > > > 2nd approach. > Develop a new "torus" topo component, as explained by Jeff. > This is certainly the *right* solution, but there are two problems: > - because of my poor familiarity with the open-mpi source > code, I am not able to estimate how long it will take me. > - in a first phase, the torus primitives will not support all > to all communications but only nearest neighbors ones. Hence, > full portability is excluded anyway and/or a fallback network > still needed. In other words, the topo component should be > able to deal with two networks, and I have no idea of how > much this will complicate things. > > > I necessarily have to push the 1st approach, for the moment, > but I am very much interested in studying the 2nd and if I > see that it is realistic (given the limitations above) and > safe, I may turn to it completely. > > thanks for your feedback and best regards, Luigi > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel