Somehow Chris' mail didn't make it back to the list (perhaps it got rejected if 
he's not subscribed).

Begin forwarded message:

> From: Christopher Yeoh <cy...@au1.ibm.com>
> Date: November 3, 2011 2:59:34 AM EDT
> To: Jeff Squyres <jsquy...@cisco.com>
> Cc: Hardware locality development list <hwloc-de...@open-mpi.org>, Brad 
> Benton <brad.ben...@us.ibm.com>
> Subject: Re: [hwloc-devel] hwloc problem
> 
> Hi Jeff,
> 
> The patch fixes the crash for me. Thanks Brice!
> 
> Regards,
> 
> Chris
> 
> On Wed, 2 Nov 2011 10:23:32 -0400
> Jeff Squyres <jsquy...@cisco.com> wrote:
> 
>> Chris --
>> 
>> Can you verify the attached patch?  If so, I'll commit it to the SVN
>> trunk and the pending OMPI v1.5 patch.
>> 
>> 
>> On Nov 2, 2011, at 10:05 AM, Brice Goglin wrote:
>> 
>>> If we can't find any other way, filtering (during export) would be
>>> an easy solution.
>>> 
>>> For the v1.2 branch, the attached patch seems to help. It just
>>> prevents the creation of internal matrices with invalid relative
>>> depth. No internal matrix, means no XML export, which means you
>>> don't break your import.
>>> 
>>> Brice
>>> 
>>> 
>>> 
>>> 
>>> Le 02/11/2011 14:59, Jeff Squyres a écrit :
>>>> Should we just filter out the "distance" attribute in the XML on
>>>> the v1.2ompi branch?  We're not using it (yet) in OMPI.
>>>> 
>>>> On Nov 2, 2011, at 9:32 AM, Brice Goglin wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> The v1.2 branch has known problems with distance matrices when
>>>>> the topology is asymmetric (especially when Linux cpuset make
>>>>> some NUMA nodes CPU-less). This is what causes wrong
>>>>> relative_depth here. It can even be negative is some cases which
>>>>> is obviously wrong.
>>>>> 
>>>>> This should be fixed in v1.3 but it's NOT easy to backport in
>>>>> v1.2. Can you check that you can export and reimport with v1.3
>>>>> properly? I will see if I can find a workaround for v1.2, but it
>>>>> will likely be something like ignore distance matrices if
>>>>> reldepth is <= 0.
>>>>> 
>>>>> In the meantime, you can remove "&& reldepth" from the "if" line
>>>>> below. It may help.
>>>>> 
>>>>> Brice
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 02/11/2011 13:42, Jeff Squyres (jsquyres) a écrit :
>>>>>>>> Hi Jeff,
>>>>>>>> 
>>>>>>>> Brad mentioned you might be able to help me with an OMPI hwloc
>>>>>>>> issue I'm having.
>>>>>>>> 
>>>>>>>> Its occurring on a Power 5 RHEL 6.0 machine and related to the
>>>>>>>> xml representation of the topology. I've attached the xml to
>>>>>>>> this email. The problem only occurs on the trunk code.
>>>>>>>> 
>>>>>>>> The part which appears to be the problem is this:
>>>>>>>> 
>>>>>>>>    <distances nbobjs="4" relative_depth="0"
>>>>>>>> latency_base="10.000000"> <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>      <latency value="1.000000"/>
>>>>>>>>    </distances>
>>>>>>>> 
>>>>>>>> specifically with relative_depth having a value of 0, but
>>>>>>>> still having latency children information. In
>>>>>>>> hwloc__xml_import_distances in topology-xml.c there's a check
>>>>>>>> that assumes there is no latency information.
>>>>>>>> 
>>>>>>>> Around line 634 in topology-xml.c:
>>>>>>>> 
>>>>>>>> if (nbobjs && reldepth && latbase) {
>>>>>>>>  ... process latency xml nodes
>>>>>>>> }
>>>>>>>> 
>>>>>>>> return hwloc__xml_import_close_tag(state);
>>>>>>>> 
>>>>>>>> The hwloc__xml_import_close_tag function returns a failure
>>>>>>>> because the latency nodes have not been processed yet.
>>>>>>>> 
>>>>>>>> I had a look in orted where the xml is created and it does
>>>>>>>> look like the xml is being assembled correctly as per the
>>>>>>>> topology information it has retrieved (though I don't know if
>>>>>>>> that itself is correct). The hwloc__xml_export_object function
>>>>>>>> will quite happily create distance information if the relative
>>>>>>>> depth is 0 even though hwloc__xml_import_distance will not be
>>>>>>>> able to parse it.
>>>>>>>> 
>>>>>>>> So there is at least a problem that the topology code will
>>>>>>>> create xml that it can't parse, but I don't know enough about
>>>>>>>> the hwloc library to know if relative depth should always be
>>>>>>>> positive. I suspect its the former which is the problem not
>>>>>>>> the latter, but I don't know for sure...
>>>>>>>> 
>>>>>>>> If it helps, this is the output of lstopo on the machine:
>>>>>>>> 
>>>>>>>> cyeoh@p5-40-P4-E0:~$ /home/OpenHPC/hwloc/build/bin/lstopo
>>>>>>>> Machine (2048MB)
>>>>>>>> NUMANode L#0 (P#0 512MB)
>>>>>>>>  Socket L#0 + L1 L#0 (32KB) + Core L#0
>>>>>>>>    PU L#0 (P#0)
>>>>>>>>    PU L#1 (P#1)
>>>>>>>>  Socket L#1 + L1 L#1 (32KB) + Core L#1
>>>>>>>>    PU L#2 (P#2)
>>>>>>>>    PU L#3 (P#3)
>>>>>>>> NUMANode L#1 (P#1 640MB)
>>>>>>>> NUMANode L#2 (P#2 512MB)
>>>>>>>> NUMANode L#3 (P#3 384MB)
>>>>> _______________________________________________
>>>>> hwloc-devel mailing list
>>>>> hwloc-de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>>>> 
>>> 
>>> <ignore_invalid_reldepth.patch>_______________________________________________
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> 
>> 
> 
> 
> 
> -- 
> cy...@au.ibm.com


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to