Chris -- Can you verify the attached patch? If so, I'll commit it to the SVN trunk and the pending OMPI v1.5 patch.
On Nov 2, 2011, at 10:05 AM, Brice Goglin wrote: > If we can't find any other way, filtering (during export) would be an > easy solution. > > For the v1.2 branch, the attached patch seems to help. It just prevents > the creation of internal matrices with invalid relative depth. No > internal matrix, means no XML export, which means you don't break your > import. > > Brice > > > > > Le 02/11/2011 14:59, Jeff Squyres a écrit : >> Should we just filter out the "distance" attribute in the XML on the >> v1.2ompi branch? We're not using it (yet) in OMPI. >> >> On Nov 2, 2011, at 9:32 AM, Brice Goglin wrote: >> >>> Hello, >>> >>> The v1.2 branch has known problems with distance matrices when the topology >>> is asymmetric (especially when Linux cpuset make some NUMA nodes CPU-less). >>> This is what causes wrong relative_depth here. It can even be negative is >>> some cases which is obviously wrong. >>> >>> This should be fixed in v1.3 but it's NOT easy to backport in v1.2. Can you >>> check that you can export and reimport with v1.3 properly? I will see if I >>> can find a workaround for v1.2, but it will likely be something like ignore >>> distance matrices if reldepth is <= 0. >>> >>> In the meantime, you can remove "&& reldepth" from the "if" line below. It >>> may help. >>> >>> Brice >>> >>> >>> >>> Le 02/11/2011 13:42, Jeff Squyres (jsquyres) a écrit : >>>>>> Hi Jeff, >>>>>> >>>>>> Brad mentioned you might be able to help me with an OMPI hwloc issue >>>>>> I'm having. >>>>>> >>>>>> Its occurring on a Power 5 RHEL 6.0 machine and related to the xml >>>>>> representation of the topology. I've attached the xml to this email. >>>>>> The problem only occurs on the trunk code. >>>>>> >>>>>> The part which appears to be the problem is this: >>>>>> >>>>>> <distances nbobjs="4" relative_depth="0" latency_base="10.000000"> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> <latency value="1.000000"/> >>>>>> </distances> >>>>>> >>>>>> specifically with relative_depth having a value of 0, but still having >>>>>> latency children information. In hwloc__xml_import_distances in >>>>>> topology-xml.c there's a check that assumes there is no latency >>>>>> information. >>>>>> >>>>>> Around line 634 in topology-xml.c: >>>>>> >>>>>> if (nbobjs && reldepth && latbase) { >>>>>> ... process latency xml nodes >>>>>> } >>>>>> >>>>>> return hwloc__xml_import_close_tag(state); >>>>>> >>>>>> The hwloc__xml_import_close_tag function returns a failure because the >>>>>> latency nodes have not been processed yet. >>>>>> >>>>>> I had a look in orted where the xml is created and it does look like >>>>>> the xml is being assembled correctly as per the topology information it >>>>>> has retrieved (though I don't know if that itself is correct). The >>>>>> hwloc__xml_export_object function will quite happily create distance >>>>>> information if the relative depth is 0 even though >>>>>> hwloc__xml_import_distance will not be able to parse it. >>>>>> >>>>>> So there is at least a problem that the topology code will create xml >>>>>> that it can't parse, but I don't know enough about the hwloc library to >>>>>> know if relative depth should always be positive. I suspect its the >>>>>> former which is the problem not the latter, but I don't know for sure... >>>>>> >>>>>> If it helps, this is the output of lstopo on the machine: >>>>>> >>>>>> cyeoh@p5-40-P4-E0:~$ /home/OpenHPC/hwloc/build/bin/lstopo >>>>>> Machine (2048MB) >>>>>> NUMANode L#0 (P#0 512MB) >>>>>> Socket L#0 + L1 L#0 (32KB) + Core L#0 >>>>>> PU L#0 (P#0) >>>>>> PU L#1 (P#1) >>>>>> Socket L#1 + L1 L#1 (32KB) + Core L#1 >>>>>> PU L#2 (P#2) >>>>>> PU L#3 (P#3) >>>>>> NUMANode L#1 (P#1 640MB) >>>>>> NUMANode L#2 (P#2 512MB) >>>>>> NUMANode L#3 (P#3 384MB) >>> _______________________________________________ >>> hwloc-devel mailing list >>> hwloc-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel >> > > <ignore_invalid_reldepth.patch>_______________________________________________ > hwloc-devel mailing list > hwloc-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/