Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread fabricio

Em 30-06-2017 17:28, Brice Goglin escreveu:

Le 30/06/2017 22:08, fabricio a écrit :

Em 30-06-2017 16:21, Brice Goglin escreveu:

Yes, it's possible but very easy. Before we go that way:
Can you also pass HWLOC_COMPONENTS_VERBOSE=1 in the environment and send
the verbose output?


///
Registered cpu discovery component `no_os' with priority 40
(statically build)
Registered global discovery component `xml' with priority 30
(statically build)
Registered global discovery component `synthetic' with priority 30
(statically build)
Registered global discovery component `custom' with priority 30
(statically build)
Registered cpu discovery component `linux' with priority 50
(statically build)
Registered misc discovery component `linuxpci' with priority 19
(statically build)
Registered misc discovery component `pci' with priority 20 (statically
build)
Registered cpu discovery component `x86' with priority 45 (statically
build)
Enabling cpu discovery component `linux'
Enabling cpu discovery component `x86'
Enabling cpu discovery component `no_os'
Excluding global discovery component `xml', conflicts with excludes 0x2
Excluding global discovery component `synthetic', conflicts with
excludes 0x2
Excluding global discovery component `custom', conflicts with excludes
0x2
Enabling misc discovery component `pci'
Enabling misc discovery component `linuxpci'
Final list of enabled discovery components: linux,x86,no_os,pci,linuxpci


* hwloc has encountered what looks like an error from the operating
system.
*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
0x003f) without inclusion!
* Error occurred in topology.c line 942
*
* The following FAQ entry in a recent hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's
mailing list,
* along with the output+tarball generated by the hwloc-gather-topology
script.


Enabling global discovery component `xml'
Excluding cpu discovery component `linux', conflicts with excludes
0x
Excluding cpu discovery component `x86', conflicts with excludes
0x
Excluding cpu discovery component `no_os', conflicts with excludes
0x
Excluding global discovery component `xml', conflicts with excludes
0x
Excluding global discovery component `synthetic', conflicts with
excludes 0x
Excluding global discovery component `custom', conflicts with excludes
0x
Excluding misc discovery component `pci', conflicts with excludes
0x
Excluding misc discovery component `linuxpci', conflicts with excludes
0x
Final list of enabled discovery components: xml
///


I am wondering if the x86 backend was disabled somehow.
Please also send your config.log


I'm using the embebbed hwloc in openmpi 1.10.7, whose version seems to
be 1.9.1. I could not find a config.log file.


I thought you were using hwloc 1.11.5? HWLOC_COMPONENTS=x86 can help
there, but not in 1.9.1 from OMPI. Which one did you try?




Setting HWLOC_COMPONENTS=-linux could also work: It totally disables the
Linux backend. If the x86 is disabled as well, you would get an almost
empty topology.


Will this leave the process allocation to the kernel, potentially
diminishing performance?


This would basically ignore all topology information.
But it's not needed anymore here since the x86 backend is enabled above.

What you can do is one of these:
* tell OMPI to use an external hwloc >= 1.11.2
* use a more recent OMPI :)
* use a XML generated with hwloc >= 1.11.2 with HWLOC_COMPONENTS=x86,
and pass it to OMPI and/or hwloc with HWLOC_XMLFILE=/path/to/foo.xml and
HWLOC_THISSYSTEM=1 in the environment. If it doesn't work, I'll generate
the XML


Updating hwloc version to 1.11.7 && recompiling openmpi && 
HWLOC_COMPONENTS=x86 made the error message disappear.


Thanks for the attention!
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread Brice Goglin
Hello

We have seen _many_ reports like these. But there are different kinds of
errors. As far as I understand:

* Julio's error is caused by the Linux kernel improperly reporting L3
cache affinities. It's specific to multi-socket 12-core processors
because the kernel makes invalid assumptions about core APIC IDs in
these processors (because only 12 out of 16 cores are enabled).
HWLOC_COMPONENTS=x86 was designed to solve this issue until AMD fixed
the kernel, but it looks like they didn't.

* Your error looks like another issue where the BIOS reports invalid
NUMA affinity (likely in the SRAT table). A BIOS upgrade may help.
Fortunately, the x86 backend can also read NUMA affinity from CPUID
instructions on AMD. I didn't know/remember HWLOC_COMPONENTS=x86 could
help for this bug too.


I am going to add this workaround to the FAQ about these errors (this
FAQ is listed in the error below since 1.11).


By the way, you should upgrade. 1.10 is vey old :)

Brice




Le 30/06/2017 21:59, Belgin, Mehmet a écrit :
> We (Georgia Tech) too have been observing this on 16-core AMD AbuDhabi
> machines (6378). We weren’t aware of HWLOC_COMPONENTS workaround,
> which seems to mitigate the issue. 
>
> *Before:*
>
> # ./lstopo
> 
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3
> cpuset 0xff00,0xff00) without inclusion!
> * Error occurred in topology.c line 940
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
> 
> Machine (128GB total)
>   Group0 L#0
> NUMANode L#0 (P#1 32GB)
> ...
>
> *After:*
>
> # export HWLOC_COMPONENTS=x86
> # ./lstopo
> Machine
>   Socket L#0
> NUMANode L#0 (P#0) + L3 L#0 (6144KB)
>   L2 L#0 (2048KB) + L1i L#0 (64KB)
> ...
>
> These nodes are the only one in our entire cluster to cause zombie
> processes using torque/moab. I have a feeling that they are related.
> We use hwloc/1.10.0.
>
> Not sure if this helps at all, but you are definitely not alone :)
>
> Thanks,
> -Mehmet
>
>
>
>> On Jun 29, 2017, at 1:24 AM, Brice Goglin > > wrote:
>>
>> Hello
>>
>> We've seen this issue many times (it's specific to 12-core opterons),
>> but I am surprised it still occurs with such a recent kernel. AMD was
>> supposed to fix the kernel in early 2016 but I forgot checking
>> whether something was actually pushed.
>>
>> Anyway, you can likely ignore the issue as documented in the FAQ
>> https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless
>> you care about L3 affinity for binding. Otherwise, you can workaround
>> the issue by passing HWLOC_COMPONENTS=x86 in the environment so that
>> hwloc uses cpuid before of Linux sysfs files for discovery the topology.
>>
>> Brice
>>
>>
>>
>>
>> Le 29/06/2017 02:17, Julio Figueroa a écrit :
>>> Hi
>>>
>>> I am experincing the following issues when using pnetcdf version 1.8.1
>>> The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238
>>> (patch_level=0x0600063d)
>>> The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
>>> OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1
>>> (2017-06-18) x86_64 GNU/Linux
>>> 
>>> * hwloc 1.11.5 has encountered what looks like an error from the
>>> operating system.
>>> *
>>> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
>>> 0x003f) without inclusion!
>>> * Error occurred in topology.c line 1074
>>> *
>>> * The following FAQ entry in the hwloc documentation may help:
>>> *   What should I do when hwloc reports "operating system" warnings?
>>> * Otherwise please report this error message to the hwloc user's
>>> mailing list,
>>> * along with the output+tarball generated by the
>>> hwloc-gather-topology script.
>>> 
>>>
>>> As suggested by the error message, here is the hwloc-gather-topology
>>> attached.
>>>
>>> Please let me know if you need more information.
>>>
>>> Julio Figueroa
>>> Oceanographer
>>>
>>>
>>>
>>> ___
>>> hwloc-users mailing list
>>> hwloc-users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
>>
>> ___
>> hwloc-users mailing list
>> hwloc-users@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
>
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread Brice Goglin
Le 30/06/2017 22:08, fabricio a écrit :
> Em 30-06-2017 16:21, Brice Goglin escreveu:
>> Yes, it's possible but very easy. Before we go that way:
>> Can you also pass HWLOC_COMPONENTS_VERBOSE=1 in the environment and send
>> the verbose output?
>
> ///
> Registered cpu discovery component `no_os' with priority 40
> (statically build)
> Registered global discovery component `xml' with priority 30
> (statically build)
> Registered global discovery component `synthetic' with priority 30
> (statically build)
> Registered global discovery component `custom' with priority 30
> (statically build)
> Registered cpu discovery component `linux' with priority 50
> (statically build)
> Registered misc discovery component `linuxpci' with priority 19
> (statically build)
> Registered misc discovery component `pci' with priority 20 (statically
> build)
> Registered cpu discovery component `x86' with priority 45 (statically
> build)
> Enabling cpu discovery component `linux'
> Enabling cpu discovery component `x86'
> Enabling cpu discovery component `no_os'
> Excluding global discovery component `xml', conflicts with excludes 0x2
> Excluding global discovery component `synthetic', conflicts with
> excludes 0x2
> Excluding global discovery component `custom', conflicts with excludes
> 0x2
> Enabling misc discovery component `pci'
> Enabling misc discovery component `linuxpci'
> Final list of enabled discovery components: linux,x86,no_os,pci,linuxpci
> 
>
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
> 0x003f) without inclusion!
> * Error occurred in topology.c line 942
> *
> * The following FAQ entry in a recent hwloc documentation may help:
> *   What should I do when hwloc reports "operating system" warnings?
> * Otherwise please report this error message to the hwloc user's
> mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
> 
>
> Enabling global discovery component `xml'
> Excluding cpu discovery component `linux', conflicts with excludes
> 0x
> Excluding cpu discovery component `x86', conflicts with excludes
> 0x
> Excluding cpu discovery component `no_os', conflicts with excludes
> 0x
> Excluding global discovery component `xml', conflicts with excludes
> 0x
> Excluding global discovery component `synthetic', conflicts with
> excludes 0x
> Excluding global discovery component `custom', conflicts with excludes
> 0x
> Excluding misc discovery component `pci', conflicts with excludes
> 0x
> Excluding misc discovery component `linuxpci', conflicts with excludes
> 0x
> Final list of enabled discovery components: xml
> ///
>
>> I am wondering if the x86 backend was disabled somehow.
>> Please also send your config.log
>
> I'm using the embebbed hwloc in openmpi 1.10.7, whose version seems to
> be 1.9.1. I could not find a config.log file.

I thought you were using hwloc 1.11.5? HWLOC_COMPONENTS=x86 can help
there, but not in 1.9.1 from OMPI. Which one did you try?

>
>> Setting HWLOC_COMPONENTS=-linux could also work: It totally disables the
>> Linux backend. If the x86 is disabled as well, you would get an almost
>> empty topology.
>
> Will this leave the process allocation to the kernel, potentially
> diminishing performance?

This would basically ignore all topology information.
But it's not needed anymore here since the x86 backend is enabled above.

What you can do is one of these:
* tell OMPI to use an external hwloc >= 1.11.2
* use a more recent OMPI :)
* use a XML generated with hwloc >= 1.11.2 with HWLOC_COMPONENTS=x86,
and pass it to OMPI and/or hwloc with HWLOC_XMLFILE=/path/to/foo.xml and
HWLOC_THISSYSTEM=1 in the environment. If it doesn't work, I'll generate
the XML

Brice

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread Belgin, Mehmet
We (Georgia Tech) too have been observing this on 16-core AMD AbuDhabi machines 
(6378). We weren’t aware of HWLOC_COMPONENTS workaround, which seems to 
mitigate the issue.

Before:

# ./lstopo

* hwloc has encountered what looks like an error from the operating system.
*
* Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3 cpuset 
0xff00,0xff00) without inclusion!
* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology script.

Machine (128GB total)
  Group0 L#0
NUMANode L#0 (P#1 32GB)
...

After:

# export HWLOC_COMPONENTS=x86
# ./lstopo
Machine
  Socket L#0
NUMANode L#0 (P#0) + L3 L#0 (6144KB)
  L2 L#0 (2048KB) + L1i L#0 (64KB)
...

These nodes are the only one in our entire cluster to cause zombie processes 
using torque/moab. I have a feeling that they are related. We use hwloc/1.10.0.

Not sure if this helps at all, but you are definitely not alone :)

Thanks,
-Mehmet



On Jun 29, 2017, at 1:24 AM, Brice Goglin 
> wrote:

Hello

We've seen this issue many times (it's specific to 12-core opterons), but I am 
surprised it still occurs with such a recent kernel. AMD was supposed to fix 
the kernel in early 2016 but I forgot checking whether something was actually 
pushed.

Anyway, you can likely ignore the issue as documented in the FAQ 
https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless you care 
about L3 affinity for binding. Otherwise, you can workaround the issue by 
passing HWLOC_COMPONENTS=x86 in the environment so that hwloc uses cpuid before 
of Linux sysfs files for discovery the topology.

Brice




Le 29/06/2017 02:17, Julio Figueroa a écrit :
Hi

I am experincing the following issues when using pnetcdf version 1.8.1
The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238 
(patch_level=0x0600063d)
The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1 (2017-06-18) 
x86_64 GNU/Linux

* hwloc 1.11.5 has encountered what looks like an error from the operating 
system.
*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 0x003f) 
without inclusion!
* Error occurred in topology.c line 1074
*
* The following FAQ entry in the hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology script.


As suggested by the error message, here is the hwloc-gather-topology
attached.

Please let me know if you need more information.

Julio Figueroa
Oceanographer




___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-30 Thread fabricio

Em 29-06-2017 02:24, Brice Goglin escreveu:

Hello Brice

I'm still seeing this error message even when passing the 
HWLOC_COMPONENTS=x86 variable.

Is it possible to generate a xml file that can silence this error?


TIA,
Fabricio
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users


Re: [hwloc-users] hwloc error in SuperMicro AMD Opteron 6238

2017-06-28 Thread Brice Goglin
Hello

We've seen this issue many times (it's specific to 12-core opterons),
but I am surprised it still occurs with such a recent kernel. AMD was
supposed to fix the kernel in early 2016 but I forgot checking whether
something was actually pushed.

Anyway, you can likely ignore the issue as documented in the FAQ
https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless
you care about L3 affinity for binding. Otherwise, you can workaround
the issue by passing HWLOC_COMPONENTS=x86 in the environment so that
hwloc uses cpuid before of Linux sysfs files for discovery the topology.

Brice




Le 29/06/2017 02:17, Julio Figueroa a écrit :
> Hi
>
> I am experincing the following issues when using pnetcdf version 1.8.1
> The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238
> (patch_level=0x0600063d)
> The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
> OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1
> (2017-06-18) x86_64 GNU/Linux
> 
> * hwloc 1.11.5 has encountered what looks like an error from the
> operating system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
> 0x003f) without inclusion!
> * Error occurred in topology.c line 1074
> *
> * The following FAQ entry in the hwloc documentation may help:
> *   What should I do when hwloc reports "operating system" warnings?
> * Otherwise please report this error message to the hwloc user's
> mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
> 
>
> As suggested by the error message, here is the hwloc-gather-topology
> attached.
>
> Please let me know if you need more information.
>
> Julio Figueroa
> Oceanographer
>
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-29 Thread Brice Goglin
Le 28/10/2015 18:04, Fabian Wein a écrit :
> I hope I'm still on the right list for my current problem.

Hello
It looks like this should go to us...@open-mpi.org now.

> -
> A request was made to bind a process, but at least one node does NOT
> support binding processes to cpus.
>
>   Node:  leo
> This usually is due to not having libnumactl and libnumactl-devel
> installed on the node.
> -
>
> I cannot find these packages for ubuntu 14.04.
>
> I find a lot of ubuntu deb packages on
> https://launchpad.net/ubuntu/+source/numactl
> But there I only find libnuma but no libnumactl.
>
> Where do I get the libnumactl and libnumactl-devel from?

On Deb-based (Debian, Ubuntu etc), the right package name is
"libnuma-dev". The OMPI message only cares about RPM distros.

> Is this the wrong thread and the wrong list?

Yeah, OpenMPI specific issues should go to OpenMPI list (hwloc is a
subproject of the OpenMPI consortium, but the software projects are
pretty much independent).

Brice


> I have a feeling that I'm quite close but just cannot reach it :(
>
> Thanks,
>
> Fabian
>
>
> On 10/27/2015 04:05 PM, Brice Goglin wrote:
>> I guess the next step would be to look at how these tasks are placed on
>> the machine. There are 8 NUMA nodes on the machine. Maybe 9 is where it
>> starts placing a second task per NUMA node?
>> For OMPI, --report-bindings may help. I am not sure about MPICH.
>>
>> Brice
>>
>>
>>
>> Le 27/10/2015 15:52, Fabian Wein a écrit :
>>> On 10/27/2015 03:42 PM, Brice Goglin wrote:
 I guess the problem is that your OMPI uses an old hwloc internally.
 That
 one may be too old to understand recent XML exports.
 Try replacing "Package" with "Socket" everywhere in the XML file.
>>>
>>> Thanks! That was it.
>>>
>>> I now get almost perfectly reproducible results.
>>>
>>> np  speedup
>>> 1 1.0
>>> 2 1.99
>>> 3 2.98
>>> 4 3.98
>>> 5 4.89
>>> 6 5.9
>>> 7 6.89
>>> 8 7.87
>>> 9 5.44
>>> 10 6.04
>>> 11 6.55
>>> 12 7.0
>>> 13 7.75
>>> 14 8.24
>>> 15 8.41
>>> 16 9.4
>>> 17 7.33
>>> 18 7.16
>>> 19 8.05
>>> 20 8.39
>>>
>>> What still puzzles me is the almost perfect speedup up to eight and
>>> than the
>>> drop down. But for the beginning 8 is already good!
>>>
>>> Thanks again,
>>>
>>> Fabian
>>>
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1210.php
>>
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1210.php
>>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post:
> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1212.php



Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-28 Thread Fabian Wein

I hope I'm still on the right list for my current problem.

Today we figured out on a similiar but older four opteron (6100) 48 
cores system that

mpiexec -bind-to numa is the essential key point.

This I want to realize on my system. I already installed libnuma such 
that hwloc configure

uses numa.

Then I configured openmpi-1.10.0 which also uses libnuma

When I compile my petsc example with MPIEXEC="orterun -bind-to numa" 
and run the application

I get
-
A request was made to bind a process, but at least one node does NOT
support binding processes to cpus.

  Node:  leo
This usually is due to not having libnumactl and libnumactl-devel
installed on the node.
-

I cannot find these packages for ubuntu 14.04.

Even when I compile numactl-2.0.9 from 
http://oss.sgi.com/projects/libnuma/

I only generates libnuma

I find a lot of ubuntu deb packages on 
https://launchpad.net/ubuntu/+source/numactl

But there I only find libnuma but no libnumactl.

Where do I get the libnumactl and libnumactl-devel from?

Is this the wrong thread and the wrong list?

I have a feeling that I'm quite close but just cannot reach it :(

Thanks,

Fabian


On 10/27/2015 04:05 PM, Brice Goglin wrote:

I guess the next step would be to look at how these tasks are placed on
the machine. There are 8 NUMA nodes on the machine. Maybe 9 is where it
starts placing a second task per NUMA node?
For OMPI, --report-bindings may help. I am not sure about MPICH.

Brice



Le 27/10/2015 15:52, Fabian Wein a écrit :

On 10/27/2015 03:42 PM, Brice Goglin wrote:

I guess the problem is that your OMPI uses an old hwloc internally. That
one may be too old to understand recent XML exports.
Try replacing "Package" with "Socket" everywhere in the XML file.


Thanks! That was it.

I now get almost perfectly reproducible results.

np  speedup
1 1.0
2 1.99
3 2.98
4 3.98
5 4.89
6 5.9
7 6.89
8 7.87
9 5.44
10 6.04
11 6.55
12 7.0
13 7.75
14 8.24
15 8.41
16 9.4
17 7.33
18 7.16
19 8.05
20 8.39

What still puzzles me is the almost perfect speedup up to eight and
than the
drop down. But for the beginning 8 is already good!

Thanks again,

Fabian

___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post:
http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1210.php


___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1210.php



Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread Brice Goglin
I guess the next step would be to look at how these tasks are placed on
the machine. There are 8 NUMA nodes on the machine. Maybe 9 is where it
starts placing a second task per NUMA node?
For OMPI, --report-bindings may help. I am not sure about MPICH.

Brice



Le 27/10/2015 15:52, Fabian Wein a écrit :
> On 10/27/2015 03:42 PM, Brice Goglin wrote:
>> I guess the problem is that your OMPI uses an old hwloc internally. That
>> one may be too old to understand recent XML exports.
>> Try replacing "Package" with "Socket" everywhere in the XML file.
>
> Thanks! That was it.
>
> I now get almost perfectly reproducible results.
>
> np  speedup
> 1 1.0
> 2 1.99
> 3 2.98
> 4 3.98
> 5 4.89
> 6 5.9
> 7 6.89
> 8 7.87
> 9 5.44
> 10 6.04
> 11 6.55
> 12 7.0
> 13 7.75
> 14 8.24
> 15 8.41
> 16 9.4
> 17 7.33
> 18 7.16
> 19 8.05
> 20 8.39
>
> What still puzzles me is the almost perfect speedup up to eight and
> than the
> drop down. But for the beginning 8 is already good!
>
> Thanks again,
>
> Fabian
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post:
> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1210.php



Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread Fabian Wein

On 10/27/2015 03:42 PM, Brice Goglin wrote:

I guess the problem is that your OMPI uses an old hwloc internally. That
one may be too old to understand recent XML exports.
Try replacing "Package" with "Socket" everywhere in the XML file.


Thanks! That was it.

I now get almost perfectly reproducible results.

np  speedup
1 1.0
2 1.99
3 2.98
4 3.98
5 4.89
6 5.9
7 6.89
8 7.87
9 5.44
10 6.04
11 6.55
12 7.0
13 7.75
14 8.24
15 8.41
16 9.4
17 7.33
18 7.16
19 8.05
20 8.39

What still puzzles me is the almost perfect speedup up to eight and 
than the

drop down. But for the beginning 8 is already good!

Thanks again,

Fabian



Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread Brice Goglin
I guess the problem is that your OMPI uses an old hwloc internally. That
one may be too old to understand recent XML exports.
Try replacing "Package" with "Socket" everywhere in the XML file.

Brice



Le 27/10/2015 15:31, Fabian Wein a écrit :
> Thank you very much for the file.
>
> When I try with PETSc, compiled with open-mpi and icc I get
>
> --
> Failed to parse XML input with the minimalistic parser. If it was not
> generated by hwloc, try enabling full XML support with libxml2.
> --
>
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>   topology discovery failed
>   --> Returned value Not supported (-8) instead of ORTE_SUCCESS
> ---
>
> Without export HWLOC_XMLFILE
>
> I get the well known
>
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
> 0x003f) without inclusion!
> * Error occurred in topology.c line 942
> *
> * The following FAQ entry in a recent hwloc documentation may help:
> *   What should I do when hwloc reports "operating system" warnings?
> * Otherwise please report this error message to the hwloc user's
> mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology
> script.
>
> And the poor scaling
>
> Triad:55372.8884   Rate (MB/s)
> 
> np  speedup
> 1 1.0
> 2 1.03
> 3 2.98
> 4 3.98
> 5 4.95
> 6 5.96
> 7 4.15
> 8 4.73
> 9 5.36
> 10 5.94
> 11 4.79
> 12 5.25
>
> which is very random upon repetition but never better than a maximal
> speedup of 7.
> I have 24 (48) kernels and only one was used by the time by another
> process.
>
> Using mpich instead of open-mpi I get no message about the hwloc issue
> but
> the same poor and random speedups.
>
> I tried to check the xml file by myself via
> xmllint --valid leo_brice.xml  --loaddtd /usr/local/share/hwloc/hwloc.dtd
>
> However xmllint complains about hwloc.dtd itself
> /usr/local/share/hwloc/hwloc.dtd:8: parser error : StartTag: invalid
> element name
> 
>
> I have to mention that I have a mixture of hwloc. The most resent
> installed locally and
> an older as part of petsc.
>
> Any ideas?
>
> Thanks,
>
> Fabian
>
>
>
> On 10/27/2015 10:21 AM, Brice Goglin wrote:
>> Here's the fixed XML. For the record, for each NUMA node, I extended
>> the cpusets of the L3 to match the container NUMA node, and moved all
>> L2 objects as children of that L3.
>> Now you may load that XML instead of the native discovery by setting
>> HWLOC_XMLFILE=leo2.xml in your environment.
>> Brice
>>
>>
>>
>> Le 27/10/2015 10:08, Fabian Wein a écrit :
>>> Brice,
>>>
>>> thank you very much for the offer. I attached the xml file
>>> ..
>>>
>>> * hwloc 1.11.1 has encountered what looks like an error from the
>>> operating system.
>>> *
>>> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
>>> 0x003f) without inclusion!
>>> * Error occurred in topology.c line 981
>>> *
>>> ..
>>>
>>> So if you can affort the time, I apprechiate it very much!
>>>
>>> Fabian
>>>
>>>
>>>
>>> On 10/27/2015 09:52 AM, Brice Goglin wrote:
 Hello

 This bug is about L3 cache locality only, everything else should be
 fine, including cache sizes. Few applications use that locality
 information, so I assume it doesn't matter for PETSc scaling.
 We can work around the bug by loading a XML topology. There's no easy
 way to build that correct XML, but I can do it manually if you send
 your
 current broken topology (lstopo foo.xml and send this foo.xml).

 Brice



 Le 27/10/2015 09:43, Fabian Wein a écrit :
> Hello,
>
> I'm new to the list and new to the mpi-business, too.
>
> Our 4*12 Opteron 6238 system is very similar to the one from the
> original poster and I get the same error message.
> Any use in posting my logs?
>
> I compiled the latest hwloc, no change. our System is Ubunut 14.4 LTS
> with kernel 3.13. and our bios is not updated.
>
> The system scales very fine with OpenMP but fails to give any
> realistic scaling using PETSc (both for the standard
> streaming benchmark and quick tests with a given application).
>
> As far as I understand the system is fine, just the information
> gathering fails, right?!
>
> Do you know if the hwloc issue relates with our poor PETSc
> scaling? Is
> there a way to configure the topology
> manually?
>
> To me it appears that an bios update wouldn't help, right?! I
> wouldn't
> try it 

Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread Fabian Wein

Thank you very much for the file.

When I try with PETSc, compiled with open-mpi and icc I get

--
Failed to parse XML input with the minimalistic parser. If it was not
generated by hwloc, try enabling full XML support with libxml2.
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  topology discovery failed
  --> Returned value Not supported (-8) instead of ORTE_SUCCESS
---

Without export HWLOC_XMLFILE

I get the well known

* hwloc has encountered what looks like an error from the operating 
system.

*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 
0x003f) without inclusion!

* Error occurred in topology.c line 942
*
* The following FAQ entry in a recent hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's 
mailing list,
* along with the output+tarball generated by the hwloc-gather-topology 
script.


And the poor scaling

Triad:55372.8884   Rate (MB/s)

np  speedup
1 1.0
2 1.03
3 2.98
4 3.98
5 4.95
6 5.96
7 4.15
8 4.73
9 5.36
10 5.94
11 4.79
12 5.25

which is very random upon repetition but never better than a maximal 
speedup of 7.
I have 24 (48) kernels and only one was used by the time by another 
process.


Using mpich instead of open-mpi I get no message about the hwloc issue but
the same poor and random speedups.

I tried to check the xml file by myself via
xmllint --valid leo_brice.xml  --loaddtd /usr/local/share/hwloc/hwloc.dtd

However xmllint complains about hwloc.dtd itself
/usr/local/share/hwloc/hwloc.dtd:8: parser error : StartTag: invalid 
element name



I have to mention that I have a mixture of hwloc. The most resent 
installed locally and

an older as part of petsc.

Any ideas?

Thanks,

Fabian



On 10/27/2015 10:21 AM, Brice Goglin wrote:

Here's the fixed XML. For the record, for each NUMA node, I extended
the cpusets of the L3 to match the container NUMA node, and moved all
L2 objects as children of that L3.
Now you may load that XML instead of the native discovery by setting
HWLOC_XMLFILE=leo2.xml in your environment.
Brice



Le 27/10/2015 10:08, Fabian Wein a écrit :

Brice,

thank you very much for the offer. I attached the xml file
..

* hwloc 1.11.1 has encountered what looks like an error from the
operating system.
*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
0x003f) without inclusion!
* Error occurred in topology.c line 981
*
..

So if you can affort the time, I apprechiate it very much!

Fabian



On 10/27/2015 09:52 AM, Brice Goglin wrote:

Hello

This bug is about L3 cache locality only, everything else should be
fine, including cache sizes. Few applications use that locality
information, so I assume it doesn't matter for PETSc scaling.
We can work around the bug by loading a XML topology. There's no easy
way to build that correct XML, but I can do it manually if you send
your
current broken topology (lstopo foo.xml and send this foo.xml).

Brice



Le 27/10/2015 09:43, Fabian Wein a écrit :

Hello,

I'm new to the list and new to the mpi-business, too.

Our 4*12 Opteron 6238 system is very similar to the one from the
original poster and I get the same error message.
Any use in posting my logs?

I compiled the latest hwloc, no change. our System is Ubunut 14.4 LTS
with kernel 3.13. and our bios is not updated.

The system scales very fine with OpenMP but fails to give any
realistic scaling using PETSc (both for the standard
streaming benchmark and quick tests with a given application).

As far as I understand the system is fine, just the information
gathering fails, right?!

Do you know if the hwloc issue relates with our poor PETSc
scaling? Is
there a way to configure the topology
manually?

To me it appears that an bios update wouldn't help, right?! I
wouldn't
try it if it is not nesessary. I'm a user with sudo accesss,
not an administrator but we have no admin for the system.

Thanks,

Fabian
___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post:
http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1201.php


___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post:
http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1204.php





___

Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread Brice Goglin
Here's the fixed XML. For the record, for each NUMA node, I extended the
cpusets of the L3 to match the container NUMA node, and moved all L2
objects as children of that L3.
Now you may load that XML instead of the native discovery by setting
HWLOC_XMLFILE=leo2.xml in your environment.
Brice



Le 27/10/2015 10:08, Fabian Wein a écrit :
> Brice,
>
> thank you very much for the offer. I attached the xml file
> ..
>
> * hwloc 1.11.1 has encountered what looks like an error from the
> operating system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
> 0x003f) without inclusion!
> * Error occurred in topology.c line 981
> *
> ..
>
> So if you can affort the time, I apprechiate it very much!
>
> Fabian
>
>
>
> On 10/27/2015 09:52 AM, Brice Goglin wrote:
>> Hello
>>
>> This bug is about L3 cache locality only, everything else should be
>> fine, including cache sizes. Few applications use that locality
>> information, so I assume it doesn't matter for PETSc scaling.
>> We can work around the bug by loading a XML topology. There's no easy
>> way to build that correct XML, but I can do it manually if you send your
>> current broken topology (lstopo foo.xml and send this foo.xml).
>>
>> Brice
>>
>>
>>
>> Le 27/10/2015 09:43, Fabian Wein a écrit :
>>> Hello,
>>>
>>> I'm new to the list and new to the mpi-business, too.
>>>
>>> Our 4*12 Opteron 6238 system is very similar to the one from the
>>> original poster and I get the same error message.
>>> Any use in posting my logs?
>>>
>>> I compiled the latest hwloc, no change. our System is Ubunut 14.4 LTS
>>> with kernel 3.13. and our bios is not updated.
>>>
>>> The system scales very fine with OpenMP but fails to give any
>>> realistic scaling using PETSc (both for the standard
>>> streaming benchmark and quick tests with a given application).
>>>
>>> As far as I understand the system is fine, just the information
>>> gathering fails, right?!
>>>
>>> Do you know if the hwloc issue relates with our poor PETSc scaling? Is
>>> there a way to configure the topology
>>> manually?
>>>
>>> To me it appears that an bios update wouldn't help, right?! I wouldn't
>>> try it if it is not nesessary. I'm a user with sudo accesss,
>>> not an administrator but we have no admin for the system.
>>>
>>> Thanks,
>>>
>>> Fabian
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1201.php
>>
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1204.php
>>
>
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2015/10/1205.php




  

























  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


  
  
  
  
  
  



  

  

  

  
  

  

  

  
  

  

  

  
  

  

  

  
  

  

  

  
  

  

  

  

  
  



  

  

  

  
  

  

  

  
  

  

  

  
  

  

  

  
  

  

 

Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread Brice Goglin
Hello
Good to know. Did you see/test the kernel patch yet? If possible, could
you send a link to the kernel commit when it appears upstream?
Thanks
Brice


Le 27/10/2015 09:21, Ondřej Vlček a écrit :
> Dear Brice,
>   thank you for your answer. Neither upgrade of BIOS nor using the latest 
> hwloc helped. Finaly we contacted AMD and they fixed a bug in kernel which 
> coused problems with 12-core AMD processors. They should upstream the changes 
> to kernel.org soon, so that all the distros (Centos,RHEL,SUSE etc.) can pick 
> them up automatically as they create their respective next releases.
>
> Ondrej
>
>> On Monday, August 24, 2015 15:32:12 Brice Goglin wrote:
>> Hello,
>>
>> hwloc 1.7 is very old, I am surprised CentOS 7 doesn't have anything
>> more recent, maybe not in "standard" packages?
>>
>> Anyway, this is a very common error on AMD 6200 and 6300 machines.
>> See
>> http://www.open-mpi.org/projects/hwloc/doc/v1.11.0/a00030.php#faq_os_error
>> Assuming you kernel isn't too old (CentOS7 should be fine), you should
>> try to upgrade the BIOS.
>>
>> Brice
>>
>> Le 24/08/2015 15:06, Ondřej Vlček a écrit :
>>> Dear all,
>>>
>>>   I have encountered hwloc error for the AMD Opteron 6300 processor family
>>>
>>> (see below). I am using hwloc.x86_64 v1.7-3.el7, which is its latest
>>> version available in standard packages for CentOS 7. Is this something,
>>> what has been already encountered and fixed in newer versions of hwloc?
>>> Output from the hwloc-gather-topology.sh script is attached.
>>>
>>> Thank you.
>>> Ondrej Vlcek
>>>
>>> $ hwloc-info
>>> **
>>> ** * Hwloc has encountered what looks like an error from the operating
>>> system. *
>>> * object (L3 cpuset 0x03f0) intersection without inclusion!
>>> * Error occurred in topology.c line 753
>>> *
>>> * Please report this error message to the hwloc user's mailing list,
>>> * along with the output from the hwloc-gather-topology.sh script.
>>> **
>>> ** depth 0:1 Machine (type #1)
>>>  depth 1:   4 Socket (type #3)
>>>   depth 2:  8 NUMANode (type #2)
>>>depth 3: 8 L3Cache (type #4)
>>> depth 4:24 L2Cache (type #4)
>>>  depth 5:   24 L1iCache (type #4)
>>>   depth 6:  48 L1dCache (type #4)
>>>depth 7: 48 Core (type #5)
>>> depth 8:48 PU (type #6)
>>>
>>> Special depth -3:   4 Bridge (type #9)
>>> Special depth -4:   6 PCI Device (type #10)
>>> Special depth -5:   9 OS Device (type #11)
>>>
>>>
>>> ___
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/hwloc-users/2015/08/1196.php



Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread Ondřej Vlček
Dear Brice,
  thank you for your answer. Neither upgrade of BIOS nor using the latest 
hwloc helped. Finaly we contacted AMD and they fixed a bug in kernel which 
coused problems with 12-core AMD processors. They should upstream the changes 
to kernel.org soon, so that all the distros (Centos,RHEL,SUSE etc.) can pick 
them up automatically as they create their respective next releases.

Ondrej

> On Monday, August 24, 2015 15:32:12 Brice Goglin wrote:
> Hello,
> 
> hwloc 1.7 is very old, I am surprised CentOS 7 doesn't have anything
> more recent, maybe not in "standard" packages?
> 
> Anyway, this is a very common error on AMD 6200 and 6300 machines.
> See
> http://www.open-mpi.org/projects/hwloc/doc/v1.11.0/a00030.php#faq_os_error
> Assuming you kernel isn't too old (CentOS7 should be fine), you should
> try to upgrade the BIOS.
> 
> Brice
> 
> Le 24/08/2015 15:06, Ondřej Vlček a écrit :
> > Dear all,
> > 
> >   I have encountered hwloc error for the AMD Opteron 6300 processor family
> > 
> > (see below). I am using hwloc.x86_64 v1.7-3.el7, which is its latest
> > version available in standard packages for CentOS 7. Is this something,
> > what has been already encountered and fixed in newer versions of hwloc?
> > Output from the hwloc-gather-topology.sh script is attached.
> > 
> > Thank you.
> > Ondrej Vlcek
> > 
> > $ hwloc-info
> > **
> > ** * Hwloc has encountered what looks like an error from the operating
> > system. *
> > * object (L3 cpuset 0x03f0) intersection without inclusion!
> > * Error occurred in topology.c line 753
> > *
> > * Please report this error message to the hwloc user's mailing list,
> > * along with the output from the hwloc-gather-topology.sh script.
> > **
> > ** depth 0:1 Machine (type #1)
> >  depth 1:   4 Socket (type #3)
> >   depth 2:  8 NUMANode (type #2)
> >depth 3: 8 L3Cache (type #4)
> > depth 4:24 L2Cache (type #4)
> >  depth 5:   24 L1iCache (type #4)
> >   depth 6:  48 L1dCache (type #4)
> >depth 7: 48 Core (type #5)
> > depth 8:48 PU (type #6)
> > 
> > Special depth -3:   4 Bridge (type #9)
> > Special depth -4:   6 PCI Device (type #10)
> > Special depth -5:   9 OS Device (type #11)
> > 
> > 
> > ___
> > hwloc-users mailing list
> > hwloc-us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> > Link to this post:
> > http://www.open-mpi.org/community/lists/hwloc-users/2015/08/1196.php



Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

2015-10-27 Thread vlcek
Dear Brice,
  thank you for your answer. Neither upgrade of BIOS nor using the latest 
hwloc helped. Finaly we contacted AMD and they fixed a bug in kernel which 
coused problems with 12-core AMD processors. They should upstream the 
changes to kernel.org soon, so that all the distros (Centos,RHEL,SUSE etc.) 
can pick them up automatically as they create their respective next 
releases.

Ondrej



-- Původní zpráva --
Od: Brice Goglin <brice.gog...@inria.fr>
Komu: Ondrej Certik <ond...@certik.cz>
Datum: 24. 8. 2015 15:32:33
Předmět: Re: [hwloc-users] hwloc error for AMD Opteron 6300 processor family

"

Hello,

hwloc 1.7 is very old, I am surprised CentOS 7 doesn't have anything more 
recent, maybe not in "standard" packages?

Anyway, this is a very common error on AMD 6200 and 6300 machines.
See http://www.open-mpi.org/projects/hwloc/doc/v1.11.0/a00030.php#faq_os_
error
(http://www.open-mpi.org/projects/hwloc/doc/v1.11.0/a00030.php#faq_os_error)
Assuming you kernel isn't too old (CentOS7 should be fine), you should try 
to upgrade the BIOS.

Brice


Le 24/08/2015 15:06, Ondřej Vlček a écrit :
 
" 
Dear all,
  I have encountered hwloc error for the AMD Opteron 6300 processor family 
(see below). I am using hwloc.x86_64 v1.7-3.el7, which is its latest version 
available in standard packages for CentOS 7. Is this something, what has been 
already encountered and fixed in newer versions of hwloc? Output from the 
hwloc-gather-topology.sh script is attached.

Thank you.
Ondrej Vlcek

$ hwloc-info

* Hwloc has encountered what looks like an error from the operating system.
*
* object (L3 cpuset 0x03f0) intersection without inclusion!
* Error occurred in topology.c line 753
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.

depth 0:1 Machine (type #1)
 depth 1:   4 Socket (type #3)
  depth 2:  8 NUMANode (type #2)
   depth 3: 8 L3Cache (type #4)
depth 4:24 L2Cache (type #4)
 depth 5:   24 L1iCache (type #4)
  depth 6:  48 L1dCache (type #4)
   depth 7: 48 Core (type #5)
depth 8:48 PU (type #6)
Special depth -3:   4 Bridge (type #9)
Special depth -4:   6 PCI Device (type #10)
Special depth -5:   9 OS Device (type #11)



___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: http://www.open-mpi.org/community/lists/hwloc-users/2015/08/1196.php
" 

___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: http://www.open-mpi.org/community/lists/hwloc-users/2015/
08/1197.php"

Re: [hwloc-users] hwloc error with "node interleaving" disabled

2014-09-05 Thread Brice Goglin
Don't be sorry, I used "yet another" to complain about all these buggy AMD 
platforms, and not to complain about their owners ;)

Bug reports are always welcome, that's why the big warning says you should 
report it.

Also these warnings vary a little bit with the platform and processor model so 
it's hard to recognize them without training ;)

That said, I may add a FAQ entry about it.

Brice

On 5 septembre 2014 18:43:44 UTC+02:00, Jean-Pierre Adam 
<jean_pierre_a...@hotmail.com> wrote:
>Silly me ! I've just seen that Andrej reported exactly the same bug
>last month. I checked his .output file and it seems he got the same
>hardware than me. I see now why you said "yet another buggy AMD
>platform" !
>
>Sorry guys.
>
>
>Date: Fri, 5 Sep 2014 13:46:25 +0200
>From: brice.gog...@inria.fr
>To: hwloc-us...@open-mpi.org
>Subject: Re: [hwloc-users] hwloc error with "node interleaving"
>disabled
>
>
>  
>
>  
>  
>Hello
>
>  
>
>  You sent the test.output file instead of test.tar.bz2 so I can't
>  check for sure. Anyway I guess this is yet another buggy AMD
>  platform with magny-cours/interlagos/abu-dahbi Opterons (61xx,
>  62xx or 63xx). 
>
>  
>
>  Sometimes upgrading the BIOS/kernel helps. Sometimes not.
>
>  
>
>  Some L3 caches will be missing in the hwloc topology because of
>  this bug, it's likely not important for the vast majority of HPC
>  libraries.
>
>  
>
>  You may hide the warning by setting HWLOC_HIDE_ERRORS=1 in your
>  environment.
>
>  
>
>  Brice
>
>  
>
>  
>
>  
>
>  
>
>  Le 05/09/2014 12:06, Jean-Pierre Adam a écrit :
>
>
>
>  
>  Hello hwloc experts
>
>
>
>I encounter this bug when I'm using mpirun or hwloc directly :
>
>
>
>
>
>* hwloc has encountered what looks like an error from the
>operating system.
>
>*
>
>* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
>0x003f) without inclusion!
>
>* Error occurred in topology.c line 940
>
>*
>
>* Please report this error message to the hwloc user's mailing
>list,
>
>* along with the output from the hwloc-gather-topology script.
>
>
>
>
>
>The output of hwloc-gather-topology is attached. The OS is
>Centos 7.
>
>
>
>The tool launched with mpirun runs as expected, still the
>message is a bit worrying...
>
>
>
>   I was able to avoid this message by enabling "node interleaving"
>in the bios (basically disables NUMA). In my case, I got a 5%
>performance loss with that setting. It could be acceptable, but
>I would like to understant what is going on.
>
>
>
>So is my motherboard / BIOS / OS buggy ?
>
>
>
>Best regards
>
>  
>  
>
>  
>  
>
>  ___
>hwloc-users mailing list
>hwloc-us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>Link to this post:
>http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1070.php
>
>
>
>  
>
>
>___
>hwloc-users mailing list
>hwloc-us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>Link to this post:
>http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1072.php   
> 
>
>
>
>___
>hwloc-users mailing list
>hwloc-us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>Link to this post:
>http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1074.php


Re: [hwloc-users] hwloc error with "node interleaving" disabled

2014-09-05 Thread Jean-Pierre Adam
Silly me ! I've just seen that Andrej reported exactly the same bug last month. 
I checked his .output file and it seems he got the same hardware than me. I see 
now why you said "yet another buggy AMD platform" !

Sorry guys.


List-Post: hwloc-users@lists.open-mpi.org
Date: Fri, 5 Sep 2014 13:46:25 +0200
From: brice.gog...@inria.fr
To: hwloc-us...@open-mpi.org
Subject: Re: [hwloc-users] hwloc error with "node interleaving" disabled


  

  
  
Hello

  

  You sent the test.output file instead of test.tar.bz2 so I can't
  check for sure. Anyway I guess this is yet another buggy AMD
  platform with magny-cours/interlagos/abu-dahbi Opterons (61xx,
  62xx or 63xx). 

  

  Sometimes upgrading the BIOS/kernel helps. Sometimes not.

  

  Some L3 caches will be missing in the hwloc topology because of
  this bug, it's likely not important for the vast majority of HPC
  libraries.

  

  You may hide the warning by setting HWLOC_HIDE_ERRORS=1 in your
  environment.

  

  Brice

  

  

  

  

  Le 05/09/2014 12:06, Jean-Pierre Adam a écrit :



  
  Hello hwloc experts



I encounter this bug when I'm using mpirun or hwloc directly :





* hwloc has encountered what looks like an error from the
operating system.

*

* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
0x003f) without inclusion!

* Error occurred in topology.c line 940

*

* Please report this error message to the hwloc user's mailing
list,

* along with the output from the hwloc-gather-topology script.





The output of hwloc-gather-topology is attached. The OS is
Centos 7.



The tool launched with mpirun runs as expected, still the
message is a bit worrying...



I was able to avoid this message by enabling "node interleaving"
in the bios (basically disables NUMA). In my case, I got a 5%
performance loss with that setting. It could be acceptable, but
I would like to understant what is going on.



So is my motherboard / BIOS / OS buggy ?



Best regards

  
  

  
  

  ___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1070.php



  


___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1072.php
   

Re: [hwloc-users] hwloc error with "node interleaving" disabled

2014-09-05 Thread Jean-Pierre Adam
Oh, sorry I forgotten the test.tar.bz2. Here it is.

Indeed, it's an AMD platform with 6344 Opterons. Sorry, I didn't know it was a 
known bug.

I think I got the latest BIOS and kernel.

Thanks for the tip to hide the warning !



List-Post: hwloc-users@lists.open-mpi.org
Date: Fri, 5 Sep 2014 13:46:25 +0200
From: brice.gog...@inria.fr
To: hwloc-us...@open-mpi.org
Subject: Re: [hwloc-users] hwloc error with "node interleaving" disabled


  

  
  
Hello

  

  You sent the test.output file instead of test.tar.bz2 so I can't
  check for sure. Anyway I guess this is yet another buggy AMD
  platform with magny-cours/interlagos/abu-dahbi Opterons (61xx,
  62xx or 63xx). 

  

  Sometimes upgrading the BIOS/kernel helps. Sometimes not.

  

  Some L3 caches will be missing in the hwloc topology because of
  this bug, it's likely not important for the vast majority of HPC
  libraries.

  

  You may hide the warning by setting HWLOC_HIDE_ERRORS=1 in your
  environment.

  

  Brice

  

  

  

  

  Le 05/09/2014 12:06, Jean-Pierre Adam a écrit :



  
  Hello hwloc experts



I encounter this bug when I'm using mpirun or hwloc directly :





* hwloc has encountered what looks like an error from the
operating system.

*

* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
0x003f) without inclusion!

* Error occurred in topology.c line 940

*

* Please report this error message to the hwloc user's mailing
list,

* along with the output from the hwloc-gather-topology script.





The output of hwloc-gather-topology is attached. The OS is
Centos 7.



The tool launched with mpirun runs as expected, still the
message is a bit worrying...



I was able to avoid this message by enabling "node interleaving"
in the bios (basically disables NUMA). In my case, I got a 5%
performance loss with that setting. It could be acceptable, but
I would like to understant what is going on.



So is my motherboard / BIOS / OS buggy ?



Best regards

  
  

  
  

  ___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1070.php



  


___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1072.php
   

test.tar.bz2
Description: BZip2 compressed data


Re: [hwloc-users] hwloc error with "node interleaving" disabled

2014-09-05 Thread Brice Goglin
Hello

You sent the test.output file instead of test.tar.bz2 so I can't check
for sure. Anyway I guess this is yet another buggy AMD platform with
magny-cours/interlagos/abu-dahbi Opterons (61xx, 62xx or 63xx).

Sometimes upgrading the BIOS/kernel helps. Sometimes not.

Some L3 caches will be missing in the hwloc topology because of this
bug, it's likely not important for the vast majority of HPC libraries.

You may hide the warning by setting HWLOC_HIDE_ERRORS=1 in your environment.

Brice




Le 05/09/2014 12:06, Jean-Pierre Adam a écrit :
> Hello hwloc experts
>
> I encounter this bug when I'm using mpirun or hwloc directly :
>
> 
> * hwloc has encountered what looks like an error from the operating
> system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset
> 0x003f) without inclusion!
> * Error occurred in topology.c line 940
> *
> * Please report this error message to the hwloc user's mailing list,
> * along with the output from the hwloc-gather-topology script.
> 
>
> The output of hwloc-gather-topology is attached. The OS is Centos 7.
>
> The tool launched with mpirun runs as expected, still the message is a
> bit worrying...
>
> I was able to avoid this message by enabling "node interleaving" in
> the bios (basically disables NUMA). In my case, I got a 5% performance
> loss with that setting. It could be acceptable, but I would like to
> understant what is going on.
>
> So is my motherboard / BIOS / OS buggy ?
>
> Best regards
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2014/09/1070.php



Re: [hwloc-users] hwloc error

2014-08-16 Thread Andrej Prsa
Hi Brice,

> Your kernel looks recent enough, can you try upgrading your BIOS ? You
> have version 3.0b and there's a 3.5 version at
> http://www.supermicro.com/aplus/motherboard/opteron6000/sr56x0/h8qg6-f.cfm

For completeness, I just tried updating bios to 3.5; hwloc still throws
the same error. The new files are attached. I guess the bios is still
buggy... Any other ideas?

Thanks,
Andrej


newbios.output
Description: Binary data


newbios.tar.bz2
Description: application/bzip