Hello

We have seen _many_ reports like these. But there are different kinds of
errors. As far as I understand:

* Julio's error is caused by the Linux kernel improperly reporting L3
cache affinities. It's specific to multi-socket 12-core processors
because the kernel makes invalid assumptions about core APIC IDs in
these processors (because only 12 out of 16 cores are enabled).
HWLOC_COMPONENTS=x86 was designed to solve this issue until AMD fixed
the kernel, but it looks like they didn't.

* Your error looks like another issue where the BIOS reports invalid
NUMA affinity (likely in the SRAT table). A BIOS upgrade may help.
Fortunately, the x86 backend can also read NUMA affinity from CPUID
instructions on AMD. I didn't know/remember HWLOC_COMPONENTS=x86 could
help for this bug too.


I am going to add this workaround to the FAQ about these errors (this
FAQ is listed in the error below since 1.11).


By the way, you should upgrade. 1.10 is verrrry old :)

Brice




Le 30/06/2017 21:59, Belgin, Mehmet a écrit :
> We (Georgia Tech) too have been observing this on 16-core AMD AbuDhabi
> machines (6378). We weren’t aware of HWLOC_COMPONENTS workaround,
> which seems to mitigate the issue. 
>
> *Before:*
>
> # ./lstopo
> ****************************************************************************
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * Socket (P#2 cpuset 0x0000ffff,0x0) intersects with NUMANode (P#3
> cpuset 0x0000ff00,0xff000000) without inclusion!
> * Error occurred in topology.c line 940
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
> ****************************************************************************
> Machine (128GB total)
>   Group0 L#0
>     NUMANode L#0 (P#1 32GB)
> ...
>
> *After:*
>
> # export HWLOC_COMPONENTS=x86
> # ./lstopo
> Machine
>   Socket L#0
>     NUMANode L#0 (P#0) + L3 L#0 (6144KB)
>       L2 L#0 (2048KB) + L1i L#0 (64KB)
> ...
>
> These nodes are the only one in our entire cluster to cause zombie
> processes using torque/moab. I have a feeling that they are related.
> We use hwloc/1.10.0.
>
> Not sure if this helps at all, but you are definitely not alone :)
>
> Thanks,
> -Mehmet
>
>
>
>> On Jun 29, 2017, at 1:24 AM, Brice Goglin <brice.gog...@inria.fr
>> <mailto:brice.gog...@inria.fr>> wrote:
>>
>> Hello
>>
>> We've seen this issue many times (it's specific to 12-core opterons),
>> but I am surprised it still occurs with such a recent kernel. AMD was
>> supposed to fix the kernel in early 2016 but I forgot checking
>> whether something was actually pushed.
>>
>> Anyway, you can likely ignore the issue as documented in the FAQ
>> https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless
>> you care about L3 affinity for binding. Otherwise, you can workaround
>> the issue by passing HWLOC_COMPONENTS=x86 in the environment so that
>> hwloc uses cpuid before of Linux sysfs files for discovery the topology.
>>
>> Brice
>>
>>
>>
>>
>> Le 29/06/2017 02:17, Julio Figueroa a écrit :
>>> Hi
>>>
>>> I am experincing the following issues when using pnetcdf version 1.8.1
>>> The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238
>>> (patch_level=0x0600063d)
>>> The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
>>> OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1
>>> (2017-06-18) x86_64 GNU/Linux
>>> ****************************************************************************
>>> * hwloc 1.11.5 has encountered what looks like an error from the
>>> operating system.
>>> *
>>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset
>>> 0x0000003f) without inclusion!
>>> * Error occurred in topology.c line 1074
>>> *
>>> * The following FAQ entry in the hwloc documentation may help:
>>> *   What should I do when hwloc reports "operating system" warnings?
>>> * Otherwise please report this error message to the hwloc user's
>>> mailing list,
>>> * along with the output+tarball generated by the
>>> hwloc-gather-topology script.
>>> ****************************************************************************
>>>
>>> As suggested by the error message, here is the hwloc-gather-topology
>>> attached.
>>>
>>> Please let me know if you need more information.
>>>
>>> Julio Figueroa
>>> Oceanographer
>>>
>>>
>>>
>>> _______________________________________________
>>> hwloc-users mailing list
>>> hwloc-users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
>>
>> _______________________________________________
>> hwloc-users mailing list
>> hwloc-users@lists.open-mpi.org <mailto:hwloc-users@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
>
>
>
> _______________________________________________
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Reply via email to