[OMPI devel] RFC: make hwloc first-class data

Jeff Squyres Wed, 22 Sep 2010 21:08:42 -0400

WHAT: Make hwloc a 1st class item in OMPI

WHY: At least 2 pieces of new functionality want/need to use the hwloc data


WHERE: Put it in ompi/hwloc

WHEN: Some time in the 1.5 series

TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now)

--------------------------------------------------------------------------------

A long time ago, I floated the proposal of putting hwloc at the top level in 
opal so that parts of OPAL/ORTE/OMPI could use the data directly.  I didn't 
have any concrete suggestions at the time about what exactly would use the 
hwloc data -- just a feeling that "someone" would want to.

There are now two solid examples of functionality that want to use hwloc data 
directly:

1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET, 
MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right 
ones, but you get the idea).  That is, pre-defined communicators that contain 
all the MPI procs on the same socket as you, the same NUMA node as you, the 
same core as you, ...etc.

2. INRIA presented a paper at Euro MPI last week that takes process distance to 
NICs into account when coming up with the long-message splitting ratio for the 
PML.  E.g., if we have 2 openib NICs with the same bandwidth, don't just assume 
that we'll split long messages 50-50 across both of them.  Instead, use NUMA 
distances to influence calculating the ratio.  See the paper here: 
http://hal.archives-ouvertes.fr/inria-00486178/en/

A previous objection was that we are increasing our dependencies by making 
hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of 
business.  Fair enough.  But that being said, hwloc is getting a bit of a 
community growing around it: vendors are submitting patches for their hardware, 
distros are picking it up, etc.  I certainly can't predict the future, but 
hwloc looks in good shape for now.  There is a little risk in depending on 
hwloc, but I think it's small enough to be ok.

Cisco does need to be able to compile OPAL/ORTE without hwloc, however (for 
embedded environments where hwloc simply takes up space and adds no value).  I 
previously proposed wrapping a subset of the hwloc API with opal_*() functions. 
 After thinking about that a bit, that seems like a lot of work for little 
benefit -- how does one decide *which* subset of hwloc should be wrapped?

Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc (instead 
of opal/hwloc).  Indeed, the 2 places that want to use hwloc are up in the MPI 
layer -- I'm guessing that most functionality that wants hwloc will be up in 
MPI.  And if we do the build system right, we can have paffinity/hwloc and 
libmpi's hwloc all link against the same libhwloc_embedded so that:

a) there's no duplication in the process, and 
b) paffinity/hwloc can still be compiled out with the usual mechanisms to avoid 
having hwloc in OPAL/ORTE for embedded environments

(there's a little hand-waving there, but I think we can figure out the details)

We *may* want to refactor paffinity and maffinity someday, but that's not 
necessarily what I'm proposing here.

Comments?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI devel] RFC: make hwloc first-class data

Reply via email to