Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

nadia . derbey Thu, 16 Feb 2012 08:13:54 -0500

Hi Jeff,

Sorry for the delay, but my victim with 2 ib devices had been stolen ;-)

So, I ported the patch on the v1.5 branch and finally could test it.

Actually, there is no opal_hwloc_base_get_topology() in v1.5 so I had to 
set
the hwloc flags in ompi_mpi_init() and orte_odls_base_open() (i.e. the 
places
where opal_hwloc_topology is initialized).

With the new flag set, hwloc_get_nbobjs_by_type(opal_hwloc_topology, 
HWLOC_OBJ_CORE)
is now seeing the actual number of cores on the node (instead of 1 when 
our
cpuset is a singleton).

Since opal_paffinity_base_get_processor_info() calls 
module_get_processor_info()
(in hwloc/paffinity_hwloc_module.c), which in turn calls 
hwloc_get_nbobjs_by_type(),
we are now getting the right number of cores in get_ib_dev_distance().

So we are looping over the exact number of cores, looking for a potential 
binding.

So as a conclusion, there's no need for any other patch: the fix you 
committed
was the only one needed to fix the issue.

Could you please move it to v1.5 (do I need to fill a CMR)?

Thanks!

-- 
Nadia Derbey

devel-boun...@open-mpi.org wrote on 02/09/2012 06:00:48 PM:

> De : Jeff Squyres <jsquy...@cisco.com>
> A : Open MPI Developers <de...@open-mpi.org>
> Date : 02/09/2012 06:01 PM
> Objet : Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see
> processes as bound if the job has been launched by srun
> Envoyé par : devel-boun...@open-mpi.org
> 
> Nadia --
> 
> I committed the fix in the trunk to use HWLOC_WHOLE_SYSTEM and 
IO_DEVICES.
> 
> Do you want to revise your patch to use hwloc APIs with 
> opal_hwloc_topology (instead of paffinity)?  We could use that as a 
> basis for the other places you identified that are doing similar things.
> 
> 
> On Feb 9, 2012, at 8:34 AM, Ralph Castain wrote:
> 
> > Ah, okay - in that case, having the I/O device attached to the 
> "closest" object at each depth would be ideal from an OMPI perspective.
> > 
> > On Feb 9, 2012, at 6:30 AM, Brice Goglin wrote:
> > 
> >> The bios usually tells you which numa location is close to each 
> host-to-pci bridge. So the answer is yes.
> >> Brice
> >> 
> >> 
> >> Ralph Castain <r...@open-mpi.org> a écrit :
> >> I'm not sure I understand this comment. A PCI device is attached 
> to the node, not to any specific location within the node, isn't it?
> Can you really say that a PCI device is "attached" to a specific 
> NUMA location, for example?
> >> 
> >> 
> >> On Feb 9, 2012, at 6:15 AM, Jeff Squyres wrote:
> >> 
> >>> That doesn't seem too attractive from an OMPI perspective, 
> though.  We'd want to know where the PCI devices are actually rooted.
> >> 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] btl/openib: get_ib_dev_distance doesn't see processes as bound if the job has been launched by srun

Reply via email to