Hi Jeff, Sorry for the delay, but my victim with 2 ib devices had been stolen ;-)
So, I ported the patch on the v1.5 branch and finally could test it. Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had to set the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the places where opal_hwloc_topology is initialized). With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology, HWLOC_OBJ_CORE) is now seeing the actual number of cores on the node (instead of 1 when our cpuset is a singleton). Since opal_paffinity_base_get_processor_info() calls module_get_processor_info() (in hwloc/paffinity_hwloc_module.c), which in turn calls hwloc_get_nbobjs_by_type(), we are now getting the right number of cores in get_ib_dev_distance(). So we are looping over the exact number of cores, looking for a potential binding. So as a conclusion, there's no need for any other patch: the fix you committed was the only one needed to fix the issue. Could you please move it to v1.5 (do I need to fill a CMR)? Thanks! -- Nadia Derbey devel-boun...@open-mpi.org wrote on 02/09/2012 06:00:48 PM: > De : Jeff Squyres <jsquy...@cisco.com> > A : Open MPI Developers <de...@open-mpi.org> > Date : 02/09/2012 06:01 PM > Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see > processes as bound if the job has been launched by srun > Envoyé par : devel-boun...@open-mpi.org > > Nadia -- > > I committed the fix in the trunk to use HWLOC_WHOLE_SYSTEM and IO_DEVICES. > > Do you want to revise your patch to use hwloc APIs with > opal_hwloc_topology (instead of paffinity)? We could use that as a > basis for the other places you identified that are doing similar things. > > > On Feb 9, 2012, at 8:34 AM, Ralph Castain wrote: > > > Ah, okay - in that case, having the I/O device attached to the > "closest" object at each depth would be ideal from an OMPI perspective. > > > > On Feb 9, 2012, at 6:30 AM, Brice Goglin wrote: > > > >> The bios usually tells you which numa location is close to each > host-to-pci bridge. So the answer is yes. > >> Brice > >> > >> > >> Ralph Castain <r...@open-mpi.org> a écrit : > >> I'm not sure I understand this comment. A PCI device is attached > to the node, not to any specific location within the node, isn't it? > Can you really say that a PCI device is "attached" to a specific > NUMA location, for example? > >> > >> > >> On Feb 9, 2012, at 6:15 AM, Jeff Squyres wrote: > >> > >>> That doesn't seem too attractive from an OMPI perspective, > though. We'd want to know where the PCI devices are actually rooted. > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel