1) We are getting hwloc topology errors when programs startup on
some new compute nodes added into our cluster recently ...

[roberpj@bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel/bin/mpirun -np 2 --mca btl tcp,sm,self --host bro127,bro127 ./a.out
****************************************************************************
* Hwloc has encountered what looks like an error from the operating system.
*
* object intersection without inclusion!
* Error occurred in topology.c line 594
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************
Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
I am process 1 on node bro127
P1: Waiting to receive from to P0
P0:  Waiting to receive from P1
P0:  Received from to P1
Run 2 of 3
P0:  Sending to P1
P0:  Waiting to receive from P1
P0:  Received from to P1
Run 3 of 3
P0:  Sending to P1
P0:  Waiting to receive from P1
P0:  Received from to P1
P0:  Done
P1:  Sending to to P0
P1:  Waiting to receive from to P0
P1:  Sending to to P0
P1:  Waiting to receive from to P0
P1:  Sending to to P0
P1:  Done

2) Ive run hwloc-gather-topology.sh and attached bro127.tar.bz2 ...

[roberpj@bro127:~/samples/hwloc-gather-topology] /home/roberpj/builds/hwloc/1.7.2/1.7.2-debug/bin/hwloc-gather-topology $(uname -n)
Hierarchy gathered in ./bro127.tar.bz2 and kept in /tmp/tmp.Fr37QhvDGD/bro127/
****************************************************************************
* Hwloc has encountered what looks like an error from the operating system.
*
* object (Socket P#0 cpuset 0x000000ff) intersection without inclusion!
* Error occurred in topology.c line 718
*
* Please report this error message to the hwloc user's mailing list,
* along with the output from the hwloc-gather-topology.sh script.
****************************************************************************
Expected topology output stored in ./bro127.output

[roberpj@bro127:~/samples/hwloc-gather-topology] cat bro127.output
Machine (P#0 total=67106040KB DMIProductName=empty DMIProductVersion=empty DMIBoardVendor="TYAN Computer Corporation" DMIBoardName=YR190-B8238 DMIBoardVersion=empty DMIBoardAssetTag=empty DMIChassisVendor=empty DMIChassisType=3 DMIChassisVersion=empty DMIChassisAssetTag=empty DMIBIOSVendor="American Megatrends Inc." DMIBIOSVersion='V1.01.B10' DMIBIOSDate=09/26/2011 DMISysVendor=empty Backend=Linux LinuxCgroup=/)
   NUMANode L#0 (P#0 local=33551608KB total=33551608KB)
     L3Cache L#0 (size=6144KB linesize=64 ways=64)
       L2Cache L#0 (size=2048KB linesize=64 ways=16)
         L1iCache L#0 (size=64KB linesize=64 ways=2)
           L1dCache L#0 (size=16KB linesize=64 ways=4)
             Core L#0 (P#0)
               PU L#0 (P#0)
           L1dCache L#1 (size=16KB linesize=64 ways=4)
             Core L#1 (P#1)
               PU L#1 (P#1)
       L2Cache L#1 (size=2048KB linesize=64 ways=16)
         L1iCache L#1 (size=64KB linesize=64 ways=2)
           L1dCache L#2 (size=16KB linesize=64 ways=4)
             Core L#2 (P#2)
               PU L#2 (P#2)
           L1dCache L#3 (size=16KB linesize=64 ways=4)
             Core L#3 (P#3)
               PU L#3 (P#3)
     L3Cache L#1 (size=6144KB linesize=64 ways=64)
       L2Cache L#2 (size=2048KB linesize=64 ways=16)
         L1iCache L#2 (size=64KB linesize=64 ways=2)
           L1dCache L#4 (size=16KB linesize=64 ways=4)
             Core L#4 (P#0)
               PU L#4 (P#8)
           L1dCache L#5 (size=16KB linesize=64 ways=4)
             Core L#5 (P#1)
               PU L#5 (P#9)
       L2Cache L#3 (size=2048KB linesize=64 ways=16)
         L1iCache L#3 (size=64KB linesize=64 ways=2)
           L1dCache L#6 (size=16KB linesize=64 ways=4)
             Core L#6 (P#2)
               PU L#6 (P#10)
           L1dCache L#7 (size=16KB linesize=64 ways=4)
             Core L#7 (P#3)
               PU L#7 (P#11)
   NUMANode L#1 (P#1 local=33554432KB total=33554432KB)
     L3Cache L#2 (size=6144KB linesize=64 ways=64)
       L2Cache L#4 (size=2048KB linesize=64 ways=16)
         L1iCache L#4 (size=64KB linesize=64 ways=2)
           L1dCache L#8 (size=16KB linesize=64 ways=4)
             Core L#8 (P#0)
               PU L#8 (P#4)
           L1dCache L#9 (size=16KB linesize=64 ways=4)
             Core L#9 (P#1)
               PU L#9 (P#5)
       L2Cache L#5 (size=2048KB linesize=64 ways=16)
         L1iCache L#5 (size=64KB linesize=64 ways=2)
           L1dCache L#10 (size=16KB linesize=64 ways=4)
             Core L#10 (P#2)
               PU L#10 (P#6)
           L1dCache L#11 (size=16KB linesize=64 ways=4)
             Core L#11 (P#3)
               PU L#11 (P#7)
     L3Cache L#3 (size=6144KB linesize=64 ways=64)
       L2Cache L#6 (size=2048KB linesize=64 ways=16)
         L1iCache L#6 (size=64KB linesize=64 ways=2)
           L1dCache L#12 (size=16KB linesize=64 ways=4)
             Core L#12 (P#0)
               PU L#12 (P#12)
           L1dCache L#13 (size=16KB linesize=64 ways=4)
             Core L#13 (P#1)
               PU L#13 (P#13)
       L2Cache L#7 (size=2048KB linesize=64 ways=16)
         L1iCache L#7 (size=64KB linesize=64 ways=2)
           L1dCache L#14 (size=16KB linesize=64 ways=4)
             Core L#14 (P#2)
               PU L#14 (P#14)
           L1dCache L#15 (size=16KB linesize=64 ways=4)
             Core L#15 (P#3)
              PU L#15 (P#15)
depth 0:        1 Machine (type #1)
  depth 1:      2 NUMANode (type #2)
   depth 2:     4 L3Cache (type #4)
    depth 3:    8 L2Cache (type #4)
     depth 4:   8 L1iCache (type #4)
      depth 5:  16 L1dCache (type #4)
       depth 6: 16 Core (type #5)
       depth 7: 16 PU (type #6)
latency matrix between NUMANodes (depth 1) by logical indexes:
   index     0     1
       0 1.000 1.600
       1 1.600 1.000
Topology not from this system

3) SRAT dmesg output was mentioned in another similar ticket
http://www.open-mpi.org/community/lists/hwloc-users/2012/05/0639.php
so i am including ours here also ...

[roberpj@bro127:~] dmesg | grep SRAT
ACPI: SRAT 00000000dfdba570 001D0 (v02 AMD    AGESA    00000001 AMD 00000001)
SRAT:  PXM 0 -> APIC 32 -> Node 0
SRAT:  PXM 0 -> APIC 33 -> Node 0
SRAT:  PXM 0 -> APIC 34 -> Node 0
SRAT:  PXM 0 -> APIC 35 -> Node 0
SRAT:  PXM 1 -> APIC 36 -> Node 1
SRAT:  PXM 1 -> APIC 37 -> Node 1
SRAT:  PXM 1 -> APIC 38 -> Node 1
SRAT:  PXM 1 -> APIC 39 -> Node 1
SRAT:  PXM 2 -> APIC 64 -> Node 2
SRAT:  PXM 2 -> APIC 65 -> Node 2
SRAT:  PXM 2 -> APIC 66 -> Node 2
SRAT:  PXM 2 -> APIC 67 -> Node 2
SRAT:  PXM 3 -> APIC 68 -> Node 3
SRAT:  PXM 3 -> APIC 69 -> Node 3
SRAT:  PXM 3 -> APIC 70 -> Node 3
SRAT:  PXM 3 -> APIC 71 -> Node 3
SRAT:  Node 0 PXM 0 0-a0000
SRAT:  Node 0 PXM 0 100000-e0000000
SRAT:  Node 0 PXM 0 100000000-820000000
SRAT:  Node 1 PXM 1 820000000-1020000000

4) Note the nodes have a 10GE interface on eth2 ...

[root@bro127:~] nano /var/log/messages  (snip)
Jan 15 16:03:55 bro127 kernel: ADDRCONF(NETDEV_UP): eth2: link is not ready
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: changing MTU from 1500 to 8000
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: detected SFP+: 3
Jan 15 16:03:55 bro127 kernel: SoftIWARP attached
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: detected SFP+: 3
Jan 15 16:03:55 bro127 kernel: ixgbe 0000:04:00.0: eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX Jan 15 16:03:55 bro127 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready

[roberpj@bro127:~] modinfo ixgbe
filename: /lib/modules/2.6.32-279.5.2.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko
version:        3.6.7-k
license:        GPL
description:    Intel(R) 10 Gigabit PCI Express Network Driver
author:         Intel Corporation, <linux.n...@intel.com>
srcversion:     EC64C3345C7AC6AB4BD6F5C
alias: pci: v00008086d0000154Asv*sd*bc*sc*i*
alias: pci: v00008086d00001557sv*sd*bc*sc*i*
alias: pci: v00008086d0000154Fsv*sd*bc*sc*i*
alias: pci: v00008086d0000154Dsv*sd*bc*sc*i*
alias: pci: v00008086d00001528sv*sd*bc*sc*i*
alias: pci: v00008086d000010F8sv*sd*bc*sc*i*
alias: pci: v00008086d0000151Csv*sd*bc*sc*i*
alias: pci: v00008086d00001529sv*sd*bc*sc*i*
alias: pci: v00008086d0000152Asv*sd*bc*sc*i*
alias: pci: v00008086d000010F9sv*sd*bc*sc*i*
alias: pci: v00008086d00001514sv*sd*bc*sc*i*
alias: pci: v00008086d00001507sv*sd*bc*sc*i*
alias: pci: v00008086d000010FBsv*sd*bc*sc*i*
alias: pci: v00008086d00001517sv*sd*bc*sc*i*
alias: pci: v00008086d000010FCsv*sd*bc*sc*i*
alias: pci: v00008086d000010F7sv*sd*bc*sc*i*
alias: pci: v00008086d00001508sv*sd*bc*sc*i*
alias: pci: v00008086d000010DBsv*sd*bc*sc*i*
alias: pci: v00008086d000010F4sv*sd*bc*sc*i*
alias: pci: v00008086d000010E1sv*sd*bc*sc*i*
alias: pci: v00008086d000010F1sv*sd*bc*sc*i*
alias: pci: v00008086d000010ECsv*sd*bc*sc*i*
alias: pci: v00008086d000010DDsv*sd*bc*sc*i*
alias: pci: v00008086d0000150Bsv*sd*bc*sc*i*
alias: pci: v00008086d000010C8sv*sd*bc*sc*i*
alias: pci: v00008086d000010C7sv*sd*bc*sc*i*
alias: pci: v00008086d000010C6sv*sd*bc*sc*i*
alias: pci: v00008086d000010B6sv*sd*bc*sc*i*
depends:        mdio,dca
vermagic:       2.6.32-279.5.2.el6.x86_64 SMP mod_unload modversions
parm: IntMode:Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2 (array of int) parm: FdirMode:Flow Director filtering modes (0=Off, 1=Hashing) default 1 (array of int) parm: max_vfs:Maximum number of virtual functions to allocate per physical function (uint) parm: allow_unsupported_sfp:Allow unsupported and untested SFP+ modules on 82599-based adapters (uint)

Attachment: bro127.tar.bz2
Description: BZip2 compressed data

Reply via email to