Luigi,

I tried a configuration on a small test cluster similar to your Approach 1,
with interesting (promising) results. While the topology is deterministic, I
found the actual performance is under-determined in practice - depending on
the symmetry and partitioning of the tasks and the data.  

Your second approach is understandable, as a generalized (sub-optimal)
solution, but I ended up abandoning hard-coded/hard-wired topologies in
favor of a more dynamical approach in order to improve efficiency and
effectiveness of our compute fabric. This approach is dependent upon a
context of several factors, such as concurrent schedules, priorities,
existing configurations, specific task and data partitions, etc.  I'm afraid
I cannot be more specific at this time.

There are myriad resulting topologies - some patterns already identified,
some almost indescribable (at this time). The determining factor is usually
the structure of the existing codes and particular data reduction/partition
of the job. I found the hardware and topologies closely coupled with the
existing software and data, which provides the constraint.

Ken


> -----Original Message-----
> From: devel-boun...@open-mpi.org 
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Luigi Scorzato
> Sent: Friday, October 30, 2009 2:47 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] RFC: revamp topo framework
> 
> 
> 
> I am very interested in this, but let me explain in more 
> details my present situation and goals.
> 
> I am working in a group who is testing a system under 
> development which is connected with both:
> - an ordinary all to all standard interface (where open-mpi 
> is already available) but with limited performances and scalability.
> - a custom 3D torus network, with no mpi available, custom 
> low-level communication primitives (under development), from 
> which we expect higher performance and scalability.
> 
> 
> I have two approaches in mind:
> 
> 1st approach.
> Use the standard network interface to setup MPI. However, 
> through a precompilation step, redefine a few MPI_ functions 
> (MPI_Send()
> MPI_Recv() and others) such that they call the torus 
> primitives, if the communication is between nearest 
> neighbors, and fall back into standard MPI through the 
> standard interface if not. This can only work if I can choose 
> the mpi-ranks of my system in a way that
> MPI_Cart_create() will generate coordinates consistent with 
> the physical topology.
> ***There must be a place - somewhere in the open-mpi code - 
> where the cartesian coordinates are chosen, presumably as a 
> deterministic function of the mpi-ranks and the dimensions 
> (as given by MPI_Dims_create). I expected it to be in 
> MPI_Cart_create(). But I could not find it. Can anyone 
> help?*** This approach has obvious limitations of 
> portability, besides requiring the availability of a fallback 
> network, but it gives me full control of what I need to do, 
> which is essential since my primary goal is to get a few 
> important codes working in the new system asap.
> 
> 
> 2nd approach.
> Develop a new "torus" topo component, as explained by Jeff. 
> This is certainly the *right* solution, but there are two problems:
> - because of my poor familiarity with the open-mpi source 
> code, I am not able to estimate how long it will take me.
> - in a first phase, the torus primitives will not support all 
> to all communications but only nearest neighbors ones. Hence, 
> full portability is excluded anyway and/or a fallback network 
> still needed. In other words, the topo component should be 
> able to deal with two networks, and I have no idea of how 
> much this will complicate things.
> 
> 
> I necessarily have to push the 1st approach, for the moment, 
> but I am very much interested in studying the 2nd and if I 
> see that it is realistic (given the limitations above) and 
> safe, I may turn to it completely.
> 
> thanks for your feedback and best regards, Luigi
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to