Re: [hwloc-users] Topology Error

2016-05-09 Thread Mehmet Belgin
Thank you Brice for your quick reply! We will give BIOS upgrade a try 
and share our findings with the list.


-Mehmet


On 5/9/16 6:10 PM, Brice Goglin wrote:

Le 09/05/2016 23:58, Mehmet Belgin a écrit :

Greetings!

We've been receiving this error for a while on our 64-core Interlagos
AMD machines:



* hwloc has encountered what looks like an error from the operating
system.
*
* Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3
cpuset 0xff00,0xff00) without inclusion!
* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology
script.



I've found some information in the hwloc list archives mentioning this
is due to buggy AMD platform and the impact should be limited to hwloc
missing L3 cache info (thanks Brice). If that's the case and processor
representation is correct then I am sure we can live with this, but I
still wanted to check with the list to confirm that (1) this is really
harmless and (2) are there any known solutions other than upgrading
BIOS/kernel?

Hello

The L3 bug only applies to 12-core Opteron 62xx/63xx, while you have
16-core Opterons. Your L3 locality is correct, but your NUMA locality is
wrong:
$ cat sys/devices/system/node/node*/cpumap
,00ff
ff00,ff00
00ff,
,
You should have something like this instead:
,
,
,
,

This bug is not harmless since memory buffers have a good chance of
being physically allocated far away from your cores.

This is more likely a BIOS bug. Try upgrading.

Regards
Brice

___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-users/2016/05/1274.php


--
=
Mehmet Belgin, Ph.D. (mehmet.bel...@oit.gatech.edu)
Scientific Computing Consultant | OIT - Academic and Research Technologies
Georgia Institute of Technology
258 4th Str NW, Rich Building, Room 326
Atlanta, GA  30332-0700
Office: (404) 385-0665



Re: [hwloc-users] Tolopology Error

2016-05-09 Thread Mehmet Belgin

Sorry for the typo in the subject, I meant "Topology" ;)

On 5/9/16 5:58 PM, Mehmet Belgin wrote:

Greetings!

We've been receiving this error for a while on our 64-core Interlagos 
AMD machines:


 

* hwloc has encountered what looks like an error from the operating 
system.

*
* Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3 
cpuset 0xff00,0xff00) without inclusion!

* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology 
script.
 



I've found some information in the hwloc list archives mentioning this 
is due to buggy AMD platform and the impact should be limited to hwloc 
missing L3 cache info (thanks Brice). If that's the case and processor 
representation is correct then I am sure we can live with this, but I 
still wanted to check with the list to confirm that (1) this is really 
harmless and (2) are there any known solutions other than upgrading 
BIOS/kernel?


The hwloc-gather-topology output is also attached.

Our schedulers (Torque/Moab) and MPI stacks highly rely on hwloc and I 
need to ensure that this is not a critical issue, so any suggestions 
will help.


Thank you!
-Mehmet




___
hwloc-users mailing list
hwloc-us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
Link to this post: 
http://www.open-mpi.org/community/lists/hwloc-users/2016/05/1272.php


--
=====
Mehmet Belgin, Ph.D. (mehmet.bel...@oit.gatech.edu)
Scientific Computing Consultant | OIT - Academic and Research Technologies
Georgia Institute of Technology
258 4th Str NW, Rich Building, Room 326
Atlanta, GA  30332-0700
Office: (404) 385-0665



[hwloc-users] Tolopology Error

2016-05-09 Thread Mehmet Belgin

Greetings!

We've been receiving this error for a while on our 64-core Interlagos 
AMD machines:



* hwloc has encountered what looks like an error from the operating system.
*
* Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3 
cpuset 0xff00,0xff00) without inclusion!

* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology 
script.



I've found some information in the hwloc list archives mentioning this 
is due to buggy AMD platform and the impact should be limited to hwloc 
missing L3 cache info (thanks Brice). If that's the case and processor 
representation is correct then I am sure we can live with this, but I 
still wanted to check with the list to confirm that (1) this is really 
harmless and (2) are there any known solutions other than upgrading 
BIOS/kernel?


The hwloc-gather-topology output is also attached.

Our schedulers (Torque/Moab) and MPI stacks highly rely on hwloc and I 
need to ensure that this is not a critical issue, so any suggestions 
will help.


Thank you!
-Mehmet


Machine (P#0 total=134199212KB DMIProductName="Altus 1804i" DMIProductVersion=" 
" DMIProductSerial=P1724391 DMIProductUUID=1AA86536-A5AE-E211-9A76-EE604CC27BCE 
DMIBoardVendor=Supermicro DMIBoardName=H8QG6 DMIBoardVersion=1234567890 
DMIBoardSerial=WM2BS70921 DMIBoardAssetTag=" " DMIChassisVendor=Supermicro 
DMIChassisType=17 DMIChassisVersion=1234567890 DMIChassisSerial=1234567890. 
DMIChassisAssetTag=" " DMIBIOSVendor="American Megatrends Inc." 
DMIBIOSVersion="3.0b  " DMIBIOSDate=02/01/2013 DMISysVendor="Penguin 
Computing" Backend=Linux LinuxCgroup=/)
  Group0 L#0 (total=67106732KB)
NUMANode L#0 (P#1 local=33552300KB total=33552300KB)
  Socket L#0 (P#0 CPUModel="AMD Opteron(tm) Processor 6378 
")
L3Cache L#0 (size=6144KB linesize=64 ways=64)
  L2Cache L#0 (size=2048KB linesize=64 ways=16)
L1iCache L#0 (size=64KB linesize=64 ways=2)
  L1dCache L#0 (size=16KB linesize=64 ways=4)
Core L#0 (P#0)
  PU L#0 (P#0)
  L1dCache L#1 (size=16KB linesize=64 ways=4)
Core L#1 (P#1)
  PU L#1 (P#1)
  L2Cache L#1 (size=2048KB linesize=64 ways=16)
L1iCache L#1 (size=64KB linesize=64 ways=2)
  L1dCache L#2 (size=16KB linesize=64 ways=4)
Core L#2 (P#2)
  PU L#2 (P#2)
  L1dCache L#3 (size=16KB linesize=64 ways=4)
Core L#3 (P#3)
  PU L#3 (P#3)
  L2Cache L#2 (size=2048KB linesize=64 ways=16)
L1iCache L#2 (size=64KB linesize=64 ways=2)
  L1dCache L#4 (size=16KB linesize=64 ways=4)
Core L#4 (P#4)
  PU L#4 (P#4)
  L1dCache L#5 (size=16KB linesize=64 ways=4)
Core L#5 (P#5)
  PU L#5 (P#5)
  L2Cache L#3 (size=2048KB linesize=64 ways=16)
L1iCache L#3 (size=64KB linesize=64 ways=2)
  L1dCache L#6 (size=16KB linesize=64 ways=4)
Core L#6 (P#6)
  PU L#6 (P#6)
  L1dCache L#7 (size=16KB linesize=64 ways=4)
Core L#7 (P#7)
  PU L#7 (P#7)
L3Cache L#1 (size=6144KB linesize=64 ways=64)
  L2Cache L#4 (size=2048KB linesize=64 ways=16)
L1iCache L#4 (size=64KB linesize=64 ways=2)
  L1dCache L#8 (size=16KB linesize=64 ways=4)
Core L#8 (P#0)
  PU L#8 (P#8)
  L1dCache L#9 (size=16KB linesize=64 ways=4)
Core L#9 (P#1)
  PU L#9 (P#9)
  L2Cache L#5 (size=2048KB linesize=64 ways=16)
L1iCache L#5 (size=64KB linesize=64 ways=2)
  L1dCache L#10 (size=16KB linesize=64 ways=4)
Core L#10 (P#2)
  PU L#10 (P#10)
  L1dCache L#11 (size=16KB linesize=64 ways=4)
Core L#11 (P#3)
  PU L#11 (P#11)
  L2Cache L#6 (size=2048KB linesize=64 ways=16)
L1iCache L#6 (size=64KB linesize=64 ways=2)
  L1dCache L#12 (size=16KB linesize=64 ways=4)
Core L#12 (P#4)
  PU L#12 (P#12)
  L1dCache L#13 (size=16KB linesize=64 ways=4)
Core L#13 (P#5)
  PU L#13 (P#13)
  L2Cache L#7 (size=2048KB linesize=64 ways=16)
L1iCache L#7 (size=64KB linesize=64 ways=2)
  L1dCache L#14 (size=16KB linesize=64 ways=4)
Core L#14 (P#6)
  PU L#14 (P#14)
  L1dCache L#15 (size=16KB linesize=64 ways=4)
Core L#15 (P#7)
  PU L#15