Should we just filter out the "distance" attribute in the XML on the v1.2ompi 
branch?  We're not using it (yet) in OMPI.

On Nov 2, 2011, at 9:32 AM, Brice Goglin wrote:

> Hello,
> 
> The v1.2 branch has known problems with distance matrices when the topology 
> is asymmetric (especially when Linux cpuset make some NUMA nodes CPU-less). 
> This is what causes wrong relative_depth here. It can even be negative is 
> some cases which is obviously wrong.
> 
> This should be fixed in v1.3 but it's NOT easy to backport in v1.2. Can you 
> check that you can export and reimport with v1.3 properly? I will see if I 
> can find a workaround for v1.2, but it will likely be something like ignore 
> distance matrices if reldepth is <= 0.
> 
> In the meantime, you can remove "&& reldepth" from the "if" line below. It 
> may help.
> 
> Brice
> 
> 
> 
> Le 02/11/2011 13:42, Jeff Squyres (jsquyres) a écrit :
>> 
>>> > Hi Jeff,
>>> >
>>> > Brad mentioned you might be able to help me with an OMPI hwloc issue
>>> > I'm having.
>>> >
>>> > Its occurring on a Power 5 RHEL 6.0 machine and related to the xml
>>> > representation of the topology. I've attached the xml to this email.
>>> > The problem only occurs on the trunk code.
>>> >
>>> > The part which appears to be the problem is this:
>>> >
>>> >      <distances nbobjs="4" relative_depth="0" latency_base="10.000000">
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >        <latency value="1.000000"/>
>>> >      </distances>
>>> >
>>> > specifically with relative_depth having a value of 0, but still having
>>> > latency children information. In hwloc__xml_import_distances in
>>> > topology-xml.c there's a check that assumes there is no latency
>>> > information.
>>> >
>>> > Around line 634 in topology-xml.c:
>>> >
>>> > if (nbobjs && reldepth && latbase) {
>>> >    ... process latency xml nodes
>>> > }
>>> >
>>> > return hwloc__xml_import_close_tag(state);
>>> >
>>> > The hwloc__xml_import_close_tag function returns a failure because the
>>> > latency nodes have not been processed yet.
>>> >
>>> > I had a look in orted where the xml is created and it does look like
>>> > the xml is being assembled correctly as per the topology information it
>>> > has retrieved (though I don't know if that itself is correct). The
>>> > hwloc__xml_export_object function will quite happily create distance
>>> > information if the relative depth is 0 even though
>>> > hwloc__xml_import_distance will not be able to parse it.
>>> >
>>> > So there is at least a problem that the topology code will create xml
>>> > that it can't parse, but I don't know enough about the hwloc library to
>>> > know if relative depth should always be positive. I suspect its the
>>> > former which is the problem not the latter, but I don't know for sure...
>>> >
>>> > If it helps, this is the output of lstopo on the machine:
>>> >
>>> > cyeoh@p5-40-P4-E0:~$ /home/OpenHPC/hwloc/build/bin/lstopo
>>> > Machine (2048MB)
>>> >  NUMANode L#0 (P#0 512MB)
>>> >    Socket L#0 + L1 L#0 (32KB) + Core L#0
>>> >      PU L#0 (P#0)
>>> >      PU L#1 (P#1)
>>> >    Socket L#1 + L1 L#1 (32KB) + Core L#1
>>> >      PU L#2 (P#2)
>>> >      PU L#3 (P#3)
>>> >  NUMANode L#1 (P#1 640MB)
>>> >  NUMANode L#2 (P#2 512MB)
>>> >  NUMANode L#3 (P#3 384MB)
> 
> _______________________________________________
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to