Re: [hwloc-users] Topology Error
Thank you Brice for your quick reply! We will give BIOS upgrade a try and share our findings with the list. -Mehmet On 5/9/16 6:10 PM, Brice Goglin wrote: Le 09/05/2016 23:58, Mehmet Belgin a écrit : Greetings! We've been receiving this error for a while on our 64-core Interlagos AMD machines: * hwloc has encountered what looks like an error from the operating system. * * Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3 cpuset 0xff00,0xff00) without inclusion! * Error occurred in topology.c line 940 * * Please report this error message to the hwloc user's mailing list, * along with the output+tarball generated by the hwloc-gather-topology script. I've found some information in the hwloc list archives mentioning this is due to buggy AMD platform and the impact should be limited to hwloc missing L3 cache info (thanks Brice). If that's the case and processor representation is correct then I am sure we can live with this, but I still wanted to check with the list to confirm that (1) this is really harmless and (2) are there any known solutions other than upgrading BIOS/kernel? Hello The L3 bug only applies to 12-core Opteron 62xx/63xx, while you have 16-core Opterons. Your L3 locality is correct, but your NUMA locality is wrong: $ cat sys/devices/system/node/node*/cpumap ,00ff ff00,ff00 00ff, , You should have something like this instead: , , , , This bug is not harmless since memory buffers have a good chance of being physically allocated far away from your cores. This is more likely a BIOS bug. Try upgrading. Regards Brice ___ hwloc-users mailing list hwloc-us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users Link to this post: http://www.open-mpi.org/community/lists/hwloc-users/2016/05/1274.php -- = Mehmet Belgin, Ph.D. (mehmet.bel...@oit.gatech.edu) Scientific Computing Consultant | OIT - Academic and Research Technologies Georgia Institute of Technology 258 4th Str NW, Rich Building, Room 326 Atlanta, GA 30332-0700 Office: (404) 385-0665
Re: [hwloc-users] Tolopology Error
Sorry for the typo in the subject, I meant "Topology" ;) On 5/9/16 5:58 PM, Mehmet Belgin wrote: Greetings! We've been receiving this error for a while on our 64-core Interlagos AMD machines: * hwloc has encountered what looks like an error from the operating system. * * Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3 cpuset 0xff00,0xff00) without inclusion! * Error occurred in topology.c line 940 * * Please report this error message to the hwloc user's mailing list, * along with the output+tarball generated by the hwloc-gather-topology script. I've found some information in the hwloc list archives mentioning this is due to buggy AMD platform and the impact should be limited to hwloc missing L3 cache info (thanks Brice). If that's the case and processor representation is correct then I am sure we can live with this, but I still wanted to check with the list to confirm that (1) this is really harmless and (2) are there any known solutions other than upgrading BIOS/kernel? The hwloc-gather-topology output is also attached. Our schedulers (Torque/Moab) and MPI stacks highly rely on hwloc and I need to ensure that this is not a critical issue, so any suggestions will help. Thank you! -Mehmet ___ hwloc-users mailing list hwloc-us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users Link to this post: http://www.open-mpi.org/community/lists/hwloc-users/2016/05/1272.php -- ===== Mehmet Belgin, Ph.D. (mehmet.bel...@oit.gatech.edu) Scientific Computing Consultant | OIT - Academic and Research Technologies Georgia Institute of Technology 258 4th Str NW, Rich Building, Room 326 Atlanta, GA 30332-0700 Office: (404) 385-0665
[hwloc-users] Tolopology Error
Greetings! We've been receiving this error for a while on our 64-core Interlagos AMD machines: * hwloc has encountered what looks like an error from the operating system. * * Socket (P#2 cpuset 0x,0x0) intersects with NUMANode (P#3 cpuset 0xff00,0xff00) without inclusion! * Error occurred in topology.c line 940 * * Please report this error message to the hwloc user's mailing list, * along with the output+tarball generated by the hwloc-gather-topology script. I've found some information in the hwloc list archives mentioning this is due to buggy AMD platform and the impact should be limited to hwloc missing L3 cache info (thanks Brice). If that's the case and processor representation is correct then I am sure we can live with this, but I still wanted to check with the list to confirm that (1) this is really harmless and (2) are there any known solutions other than upgrading BIOS/kernel? The hwloc-gather-topology output is also attached. Our schedulers (Torque/Moab) and MPI stacks highly rely on hwloc and I need to ensure that this is not a critical issue, so any suggestions will help. Thank you! -Mehmet Machine (P#0 total=134199212KB DMIProductName="Altus 1804i" DMIProductVersion=" " DMIProductSerial=P1724391 DMIProductUUID=1AA86536-A5AE-E211-9A76-EE604CC27BCE DMIBoardVendor=Supermicro DMIBoardName=H8QG6 DMIBoardVersion=1234567890 DMIBoardSerial=WM2BS70921 DMIBoardAssetTag=" " DMIChassisVendor=Supermicro DMIChassisType=17 DMIChassisVersion=1234567890 DMIChassisSerial=1234567890. DMIChassisAssetTag=" " DMIBIOSVendor="American Megatrends Inc." DMIBIOSVersion="3.0b " DMIBIOSDate=02/01/2013 DMISysVendor="Penguin Computing" Backend=Linux LinuxCgroup=/) Group0 L#0 (total=67106732KB) NUMANode L#0 (P#1 local=33552300KB total=33552300KB) Socket L#0 (P#0 CPUModel="AMD Opteron(tm) Processor 6378 ") L3Cache L#0 (size=6144KB linesize=64 ways=64) L2Cache L#0 (size=2048KB linesize=64 ways=16) L1iCache L#0 (size=64KB linesize=64 ways=2) L1dCache L#0 (size=16KB linesize=64 ways=4) Core L#0 (P#0) PU L#0 (P#0) L1dCache L#1 (size=16KB linesize=64 ways=4) Core L#1 (P#1) PU L#1 (P#1) L2Cache L#1 (size=2048KB linesize=64 ways=16) L1iCache L#1 (size=64KB linesize=64 ways=2) L1dCache L#2 (size=16KB linesize=64 ways=4) Core L#2 (P#2) PU L#2 (P#2) L1dCache L#3 (size=16KB linesize=64 ways=4) Core L#3 (P#3) PU L#3 (P#3) L2Cache L#2 (size=2048KB linesize=64 ways=16) L1iCache L#2 (size=64KB linesize=64 ways=2) L1dCache L#4 (size=16KB linesize=64 ways=4) Core L#4 (P#4) PU L#4 (P#4) L1dCache L#5 (size=16KB linesize=64 ways=4) Core L#5 (P#5) PU L#5 (P#5) L2Cache L#3 (size=2048KB linesize=64 ways=16) L1iCache L#3 (size=64KB linesize=64 ways=2) L1dCache L#6 (size=16KB linesize=64 ways=4) Core L#6 (P#6) PU L#6 (P#6) L1dCache L#7 (size=16KB linesize=64 ways=4) Core L#7 (P#7) PU L#7 (P#7) L3Cache L#1 (size=6144KB linesize=64 ways=64) L2Cache L#4 (size=2048KB linesize=64 ways=16) L1iCache L#4 (size=64KB linesize=64 ways=2) L1dCache L#8 (size=16KB linesize=64 ways=4) Core L#8 (P#0) PU L#8 (P#8) L1dCache L#9 (size=16KB linesize=64 ways=4) Core L#9 (P#1) PU L#9 (P#9) L2Cache L#5 (size=2048KB linesize=64 ways=16) L1iCache L#5 (size=64KB linesize=64 ways=2) L1dCache L#10 (size=16KB linesize=64 ways=4) Core L#10 (P#2) PU L#10 (P#10) L1dCache L#11 (size=16KB linesize=64 ways=4) Core L#11 (P#3) PU L#11 (P#11) L2Cache L#6 (size=2048KB linesize=64 ways=16) L1iCache L#6 (size=64KB linesize=64 ways=2) L1dCache L#12 (size=16KB linesize=64 ways=4) Core L#12 (P#4) PU L#12 (P#12) L1dCache L#13 (size=16KB linesize=64 ways=4) Core L#13 (P#5) PU L#13 (P#13) L2Cache L#7 (size=2048KB linesize=64 ways=16) L1iCache L#7 (size=64KB linesize=64 ways=2) L1dCache L#14 (size=16KB linesize=64 ways=4) Core L#14 (P#6) PU L#14 (P#14) L1dCache L#15 (size=16KB linesize=64 ways=4) Core L#15 (P#7) PU L#15