1) We are getting hwloc topology errors when programs startup on some new compute nodes added into our cluster recently ...
[roberpj@bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel/bin/mpirun -np 2 --mca btl tcp,sm,self --host bro127,bro127 ./a.out
**************************************************************************** * Hwloc has encountered what looks like an error from the operating system. * * object intersection without inclusion! * Error occurred in topology.c line 594 * * Please report this error message to the hwloc user's mailing list, * along with the output from the hwloc-gather-topology.sh script. **************************************************************************** Number of processes = 2 Test repeated 3 times for reliability I am process 0 on node bro127 Run 1 of 3 P0: Sending to P1 I am process 1 on node bro127 P1: Waiting to receive from to P0 P0: Waiting to receive from P1 P0: Received from to P1 Run 2 of 3 P0: Sending to P1 P0: Waiting to receive from P1 P0: Received from to P1 Run 3 of 3 P0: Sending to P1 P0: Waiting to receive from P1 P0: Received from to P1 P0: Done P1: Sending to to P0 P1: Waiting to receive from to P0 P1: Sending to to P0 P1: Waiting to receive from to P0 P1: Sending to to P0 P1: Done 2) Ive run hwloc-gather-topology.sh and attached bro127.tar.bz2 ...[roberpj@bro127:~/samples/hwloc-gather-topology] /home/roberpj/builds/hwloc/1.7.2/1.7.2-debug/bin/hwloc-gather-topology $(uname -n)
Hierarchy gathered in ./bro127.tar.bz2 and kept in /tmp/tmp.Fr37QhvDGD/bro127/ **************************************************************************** * Hwloc has encountered what looks like an error from the operating system. * * object (Socket P#0 cpuset 0x000000ff) intersection without inclusion! * Error occurred in topology.c line 718 * * Please report this error message to the hwloc user's mailing list, * along with the output from the hwloc-gather-topology.sh script. **************************************************************************** Expected topology output stored in ./bro127.output [roberpj@bro127:~/samples/hwloc-gather-topology] cat bro127.outputMachine (P#0 total=67106040KB DMIProductName=empty DMIProductVersion=empty DMIBoardVendor="TYAN Computer Corporation" DMIBoardName=YR190-B8238 DMIBoardVersion=empty DMIBoardAssetTag=empty DMIChassisVendor=empty DMIChassisType=3 DMIChassisVersion=empty DMIChassisAssetTag=empty DMIBIOSVendor="American Megatrends Inc." DMIBIOSVersion='V1.01.B10' DMIBIOSDate=09/26/2011 DMISysVendor=empty Backend=Linux LinuxCgroup=/)
NUMANode L#0 (P#0 local=33551608KB total=33551608KB) L3Cache L#0 (size=6144KB linesize=64 ways=64) L2Cache L#0 (size=2048KB linesize=64 ways=16) L1iCache L#0 (size=64KB linesize=64 ways=2) L1dCache L#0 (size=16KB linesize=64 ways=4) Core L#0 (P#0) PU L#0 (P#0) L1dCache L#1 (size=16KB linesize=64 ways=4) Core L#1 (P#1) PU L#1 (P#1) L2Cache L#1 (size=2048KB linesize=64 ways=16) L1iCache L#1 (size=64KB linesize=64 ways=2) L1dCache L#2 (size=16KB linesize=64 ways=4) Core L#2 (P#2) PU L#2 (P#2) L1dCache L#3 (size=16KB linesize=64 ways=4) Core L#3 (P#3) PU L#3 (P#3) L3Cache L#1 (size=6144KB linesize=64 ways=64) L2Cache L#2 (size=2048KB linesize=64 ways=16) L1iCache L#2 (size=64KB linesize=64 ways=2) L1dCache L#4 (size=16KB linesize=64 ways=4) Core L#4 (P#0) PU L#4 (P#8) L1dCache L#5 (size=16KB linesize=64 ways=4) Core L#5 (P#1) PU L#5 (P#9) L2Cache L#3 (size=2048KB linesize=64 ways=16) L1iCache L#3 (size=64KB linesize=64 ways=2) L1dCache L#6 (size=16KB linesize=64 ways=4) Core L#6 (P#2) PU L#6 (P#10) L1dCache L#7 (size=16KB linesize=64 ways=4) Core L#7 (P#3) PU L#7 (P#11) NUMANode L#1 (P#1 local=33554432KB total=33554432KB) L3Cache L#2 (size=6144KB linesize=64 ways=64) L2Cache L#4 (size=2048KB linesize=64 ways=16) L1iCache L#4 (size=64KB linesize=64 ways=2) L1dCache L#8 (size=16KB linesize=64 ways=4) Core L#8 (P#0) PU L#8 (P#4) L1dCache L#9 (size=16KB linesize=64 ways=4) Core L#9 (P#1) PU L#9 (P#5) L2Cache L#5 (size=2048KB linesize=64 ways=16) L1iCache L#5 (size=64KB linesize=64 ways=2) L1dCache L#10 (size=16KB linesize=64 ways=4) Core L#10 (P#2) PU L#10 (P#6) L1dCache L#11 (size=16KB linesize=64 ways=4) Core L#11 (P#3) PU L#11 (P#7) L3Cache L#3 (size=6144KB linesize=64 ways=64) L2Cache L#6 (size=2048KB linesize=64 ways=16) L1iCache L#6 (size=64KB linesize=64 ways=2) L1dCache L#12 (size=16KB linesize=64 ways=4) Core L#12 (P#0) PU L#12 (P#12) L1dCache L#13 (size=16KB linesize=64 ways=4) Core L#13 (P#1) PU L#13 (P#13) L2Cache L#7 (size=2048KB linesize=64 ways=16) L1iCache L#7 (size=64KB linesize=64 ways=2) L1dCache L#14 (size=16KB linesize=64 ways=4) Core L#14 (P#2) PU L#14 (P#14) L1dCache L#15 (size=16KB linesize=64 ways=4) Core L#15 (P#3) PU L#15 (P#15) depth 0: 1 Machine (type #1) depth 1: 2 NUMANode (type #2) depth 2: 4 L3Cache (type #4) depth 3: 8 L2Cache (type #4) depth 4: 8 L1iCache (type #4) depth 5: 16 L1dCache (type #4) depth 6: 16 Core (type #5) depth 7: 16 PU (type #6) latency matrix between NUMANodes (depth 1) by logical indexes: index 0 1 0 1.000 1.600 1 1.600 1.000 Topology not from this system 3) SRAT dmesg output was mentioned in another similar ticket http://www.open-mpi.org/community/lists/hwloc-users/2012/05/0639.php so i am including ours here also ... [roberpj@bro127:~] dmesg | grep SRAT ACPI: SRAT 00000000dfdba570 001D0 (v02 AMD AGESA 00000001 AMD 00000001) SRAT: PXM 0 -> APIC 32 -> Node 0 SRAT: PXM 0 -> APIC 33 -> Node 0 SRAT: PXM 0 -> APIC 34 -> Node 0 SRAT: PXM 0 -> APIC 35 -> Node 0 SRAT: PXM 1 -> APIC 36 -> Node 1 SRAT: PXM 1 -> APIC 37 -> Node 1 SRAT: PXM 1 -> APIC 38 -> Node 1 SRAT: PXM 1 -> APIC 39 -> Node 1 SRAT: PXM 2 -> APIC 64 -> Node 2 SRAT: PXM 2 -> APIC 65 -> Node 2 SRAT: PXM 2 -> APIC 66 -> Node 2 SRAT: PXM 2 -> APIC 67 -> Node 2 SRAT: PXM 3 -> APIC 68 -> Node 3 SRAT: PXM 3 -> APIC 69 -> Node 3 SRAT: PXM 3 -> APIC 70 -> Node 3 SRAT: PXM 3 -> APIC 71 -> Node 3 SRAT: Node 0 PXM 0 0-a0000 SRAT: Node 0 PXM 0 100000-e0000000 SRAT: Node 0 PXM 0 100000000-820000000 SRAT: Node 1 PXM 1 820000000-1020000000 4) Note the nodes have a 10GE interface on eth2 ... [root@bro127:~] nano /var/log/messages (snip) Jan 15 16:03:55 bro127 kernel: ADDRCONF(NETDEV_UP): eth2: link is not readyJan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: changing MTU from 1500 to 8000
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: detected SFP+: 3 Jan 15 16:03:55 bro127 kernel: SoftIWARP attached Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: detected SFP+: 3Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX Jan 15 16:03:55 bro127 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[roberpj@bro127:~] modinfo ixgbefilename: /lib/modules/2.6.32-279.5.2.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko
version: 3.6.7-k license: GPL description: Intel(R) 10 Gigabit PCI Express Network Driver author: Intel Corporation, <linux.n...@intel.com> srcversion: EC64C3345C7AC6AB4BD6F5C alias: pci: v00008086d0000154Asv*sd*bc*sc*i* alias: pci: v00008086d00001557sv*sd*bc*sc*i* alias: pci: v00008086d0000154Fsv*sd*bc*sc*i* alias: pci: v00008086d0000154Dsv*sd*bc*sc*i* alias: pci: v00008086d00001528sv*sd*bc*sc*i* alias: pci: v00008086d000010F8sv*sd*bc*sc*i* alias: pci: v00008086d0000151Csv*sd*bc*sc*i* alias: pci: v00008086d00001529sv*sd*bc*sc*i* alias: pci: v00008086d0000152Asv*sd*bc*sc*i* alias: pci: v00008086d000010F9sv*sd*bc*sc*i* alias: pci: v00008086d00001514sv*sd*bc*sc*i* alias: pci: v00008086d00001507sv*sd*bc*sc*i* alias: pci: v00008086d000010FBsv*sd*bc*sc*i* alias: pci: v00008086d00001517sv*sd*bc*sc*i* alias: pci: v00008086d000010FCsv*sd*bc*sc*i* alias: pci: v00008086d000010F7sv*sd*bc*sc*i* alias: pci: v00008086d00001508sv*sd*bc*sc*i* alias: pci: v00008086d000010DBsv*sd*bc*sc*i* alias: pci: v00008086d000010F4sv*sd*bc*sc*i* alias: pci: v00008086d000010E1sv*sd*bc*sc*i* alias: pci: v00008086d000010F1sv*sd*bc*sc*i* alias: pci: v00008086d000010ECsv*sd*bc*sc*i* alias: pci: v00008086d000010DDsv*sd*bc*sc*i* alias: pci: v00008086d0000150Bsv*sd*bc*sc*i* alias: pci: v00008086d000010C8sv*sd*bc*sc*i* alias: pci: v00008086d000010C7sv*sd*bc*sc*i* alias: pci: v00008086d000010C6sv*sd*bc*sc*i* alias: pci: v00008086d000010B6sv*sd*bc*sc*i* depends: mdio,dca vermagic: 2.6.32-279.5.2.el6.x86_64 SMP mod_unload modversionsparm: IntMode:Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2 (array of int) parm: FdirMode:Flow Director filtering modes (0=Off, 1=Hashing) default 1 (array of int) parm: max_vfs:Maximum number of virtual functions to allocate per physical function (uint) parm: allow_unsupported_sfp:Allow unsupported and untested SFP+ modules on 82599-based adapters (uint)
bro127.tar.bz2
Description: BZip2 compressed data