Re: [OMPI devel] [RFC] Hierarchical Topology

Sylvain Jeaugey Tue, 16 Nov 2010 03:28:56 -0500

On Mon, 15 Nov 2010, Ralph Castain wrote:

Guess I am a little confused. Every MPI process already has full knowledge
of what node all other processes are located on - this has been true for
quite a long time.

Ok, I didn't see that.

Once my work is complete, mpirun will have full knowledge of each node's
hardware resources. Terry will then use that in mpirun's mappers. The
resulting launch message will contain a full mapping of procs to cores -
i.e., every daemon will know the core placement of every process in the job.
That info will be passed down to each MPI proc. Thus, upon launch, every MPI
process will know not only the node for each process, but also the hardware
resources of that node, and the bindings of every process in the job to that
hardware.

Allright.

Some things bug me however :

1. What if the placement has been done by a wrapper script or by theresource manager ? I.e. how do you know where MPI procs are located ?2. How scalable is it ? I would think there an allgather with 1 processper node ; am I right ?

 3. How is that information represented ? As a graph ?

So the only thing missing is the switch topology of the cluster (the
inter-node topology). We modified carto a while back to support input of
switch topology information, though I'm not sure how many people ever used
that capability - not much value in it so far. We just set it up so that
people could describe the topology, and then let carto compute hop distance.

Ok. I didn't know we also had some work on switches in carto.

HTH

This helps !

So, I'm now wondering if both work, which would seem similar are reallyredundant. We though about this before starting hitopo, and since a graphdidn't fit our needs, we started work towards computing an address.Perhaps hitopo addresses could be computed using hwloc's graph.

I understand that for sm optimization, hwloc is richer. The only thingthat bugs me is how much time it takes to figure out what capability Ihave between process A and B. The great thing in hitopo is that a singlecomparison can give you a property of two processes (e.g. they are on thesame socket).

Anyway, I just wanted to present hitopo in case someone would need it. AndI think hitopo's prefered domain remains collectives, where you do notreally need distances, but groups which share a certain locality.


Sylvain

On Mon, Nov 15, 2010 at 9:00 AM, Sylvain Jeaugey
<sylvain.jeau...@bull.net>wrote:

I already mentionned it answering Terry's e-mail, but to be sure I'm clear
: don't confuse node full topology with MPI job topology. It _is_ different.

And every process does not get the whole topology in hitopo, only its own,
which should not cause storms.


On Mon, 15 Nov 2010, Ralph Castain wrote:

 I think the two efforts (the paffinity and this one) do overlap somewhat.

I've been writing the local topology discovery code for Jeff, Terry, and
Josh - uses hwloc (or any other method - it's a framework) to discover
what
hardware resources are available on each node in the job so that the info
can be used in mapping the procs.

As part of that work, we are passing down to the mpi processes the local
hardware topology. This is done because of prior complaints when we had
each
mpi process discover that info for itself - it creates a bit of a "storm"
on
the node of large smp's.

Note that what I've written (still to be completed before coming over)
doesn't tell the proc what cores/HT's it is bound to - that's the part
Terry
et al are adding. Nor were we discovering the switch topology of the
cluster.

So a little overlap that could be resolved. And a concern on my part: we
have previously introduced capabilities that had every mpi process read
local system files to get node topology, and gotten user complaints about
it. We probably shouldn't go back to that practice.

Ralph


On Mon, Nov 15, 2010 at 8:15 AM, Terry Dontje <terry.don...@oracle.com

wrote:


  A few comments:


1.  Have you guys considered using hwloc for level 4-7 detection?
2.  Is L2 related to L2 cache?  If no then is there some other term you
could use?
3.  What do you see if the process is bound to multiple
cores/hyperthreads?
4.  What do you see if the process is not bound to any level 4-7 items?
5.  What about L1 and L2 cache locality as some levels? (hwloc exposes
these but these are also at different depths depending on the platform).

Note I am working with Jeff Squyres and Josh Hursey on some new paffinity
code that uses hwloc.  Though the paffinity code may not have direct
relationship to hitopo the use of hwloc and standardization of what you
call
level 4-7 might help avoid some user confusions.

--td


On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote:

As a followup of Stuttgart's developper's meeting, here is an RFC for our
topology detection framework.

WHAT: Add a framework for hardware topology detection to be used by any
other part of Open MPI to help optimization.

WHY: Collective operations or shared memory algorithms among others may
have optimizations depending on the hardware relationship between two MPI
processes. HiTopo is an attempt to provide it in a unified manner.

WHERE: ompi/mca/hitopo/

WHEN: When wanted.


==========================================================================
We developped the HiTopo framework for our collective operation
component,
but it may be useful for other parts of Open MPI, so we'd like to
contribute
it.

A wiki page has been setup :
https://svn.open-mpi.org/trac/ompi/wiki/HiTopo

and a bitbucket repository :
http://bitbucket.org/jeaugeys/hitopo/

In a few words, we have 3 steps in HiTopo :

 - Detection : each MPI process detects its topology at various levels :
   - core/socket : through the cpuid component
   - node : through gethostname
   - switch/island : through openib (mad) or slurm
     [ Other topology detection components may be added for other
       resource managers, specific hardware or whatever we want ...]

 - Collection : an allgather is performed to have all other processes'
addresses

 - Renumbering : "string" addresses are converted to numbers starting at
0
(Example : nodenames "foo" and "bar" are renamed 0 and 1).

Any comment welcome,
Sylvain
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
[image: Oracle]
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
 Oracle * - Performance Technologies*
 95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com




_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

 _______________________________________________


devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [RFC] Hierarchical Topology

Reply via email to