I unfortunately don't have many cycles to think about this before Oct 1, but I'm still a little concerned about the portability aspects of having hwloc be a first class citizen of OMPI - if we support a platform hwloc doesn't, that seems like it will still cause problems...
Brian On Sep 22, 2010, at 7:08 PM, Jeff Squyres wrote: > WHAT: Make hwloc a 1st class item in OMPI > > WHY: At least 2 pieces of new functionality want/need to use the hwloc data > > WHERE: Put it in ompi/hwloc > > WHEN: Some time in the 1.5 series > > TIMEOUT: Tues teleconf, Oct 5 (about 2 weeks from now) > > -------------------------------------------------------------------------------- > > A long time ago, I floated the proposal of putting hwloc at the top level in > opal so that parts of OPAL/ORTE/OMPI could use the data directly. I didn't > have any concrete suggestions at the time about what exactly would use the > hwloc data -- just a feeling that "someone" would want to. > > There are now two solid examples of functionality that want to use hwloc data > directly: > > 1. Sandia + ORNL are working on a proposal for MPI_COMM_SOCKET, > MPI_COMM_NUMA_NODE, MPI_COMM_CORE, ...etc. (those names may not be the right > ones, but you get the idea). That is, pre-defined communicators that contain > all the MPI procs on the same socket as you, the same NUMA node as you, the > same core as you, ...etc. > > 2. INRIA presented a paper at Euro MPI last week that takes process distance > to NICs into account when coming up with the long-message splitting ratio for > the PML. E.g., if we have 2 openib NICs with the same bandwidth, don't just > assume that we'll split long messages 50-50 across both of them. Instead, > use NUMA distances to influence calculating the ratio. See the paper here: > http://hal.archives-ouvertes.fr/inria-00486178/en/ > > A previous objection was that we are increasing our dependencies by making > hwloc be a 1st-class entity in OPAL -- we're hosed if hwloc ever goes out of > business. Fair enough. But that being said, hwloc is getting a bit of a > community growing around it: vendors are submitting patches for their > hardware, distros are picking it up, etc. I certainly can't predict the > future, but hwloc looks in good shape for now. There is a little risk in > depending on hwloc, but I think it's small enough to be ok. > > Cisco does need to be able to compile OPAL/ORTE without hwloc, however (for > embedded environments where hwloc simply takes up space and adds no value). > I previously proposed wrapping a subset of the hwloc API with opal_*() > functions. After thinking about that a bit, that seems like a lot of work > for little benefit -- how does one decide *which* subset of hwloc should be > wrapped? > > Instead, it might be worthwhile to simply put hwloc up in ompi/hwloc (instead > of opal/hwloc). Indeed, the 2 places that want to use hwloc are up in the > MPI layer -- I'm guessing that most functionality that wants hwloc will be up > in MPI. And if we do the build system right, we can have paffinity/hwloc and > libmpi's hwloc all link against the same libhwloc_embedded so that: > > a) there's no duplication in the process, and > b) paffinity/hwloc can still be compiled out with the usual mechanisms to > avoid having hwloc in OPAL/ORTE for embedded environments > > (there's a little hand-waving there, but I think we can figure out the > details) > > We *may* want to refactor paffinity and maffinity someday, but that's not > necessarily what I'm proposing here. > > Comments? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories