Interesting, quite right, thank you very much.  Yes these are AMD 6300 series.  
Same kernel but these boxes seem to have different BIOS versions, direct from 
the factory, delivered in the same physical enclosure even!  Some are AMI 3.5 
and some are 3.0.

So slurm is then incorrectly parsing correct output from lstopo to generate 
this message?

May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware: 
CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) CoresPerSocket=16:8(hw) 
ThreadsPerCore=1:1(hw)
Thanks much,


Craig


On Wednesday, May 28, 2014 1:39 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
 


Aside of the BIOS config, are you sure that you have the exact same BIOS 
*version* in each node? (can check in /sys/class/dmi/id/bios_*) Same Linux 
kernel too?

Also, recently we've seen somebody fix such problems by unplugging
      and replugging some CPUs on the motherboard. Seems crazy but it
      happened for real...

By the way, your discussion of groups 1 and 2 below is wrong.
      Group 2 doesn't say that NUMA node == socket, and it doesn't
      report 8 sockets of 8 cores each. It reports 4 sockets containing
      2 NUMA nodes each containing 8 cores each, and that's likely what
      you have here (AMD Opteron 6300 or 6200 processors?).

Brice




Le 28/05/2014 12:27, Craig Kapfer a écrit :

We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and some 
of them are reporting the following error from slurm, which I gather gets its 
info from hwloc: 
>May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware: 
>CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) CoresPerSocket=16:8(hw) 
>ThreadsPerCore=1:1(hw)
>All nodes have the exact same CPUs, motherboards and OS (PXE booted from the 
>same master image even).  The bios settings between nodes also look the same.  
>The nodes only differ in the amount of memory and number of DIMMs.  
>There are two sets of nodes with different output from lstopo: Group 1 
>(correct): reporting 4 sockets with 16 cores per socket
Group 2 (incorrect): reporting 8 sockets with 8 cores per socket Group 2 seems 
to be (incorrectly?) taking numanodes as sockets. The output of lstopo is 
slightly different in the two groups, note the extra Socket layer for group 2: 
Group 1: Machine (128GB) NUMANode L#0 (P#0 32GB) + Socket L#0 #16 cores listed 
<snip> NUMANode L#1 (P#2 32GB) + Socket L#1 #16 cores listed etc
<snip> Group 2: Machine (256GB) Socket L#0 (64GB) NUMANode L#0 (P#0 32GB) + L3 
L#0 (6144KB) # 8 cores listed <snip> NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB) 
# 8 cores listed <snip> Socket L#1 (64GB) NUMANode L#2 (P#2 32GB) + L3 L#2 
(6144KB) # 8 cores listed etc
<snip> The group 2 reporting doesn't match our hardware, at least as far as 
sockets and cores per socket goes--is there a reason other than the memory 
configuration that could cause this? 
>Thanks,
>Craig
>
>
>
>
>_______________________________________________
hwloc-users mailing list hwloc-us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Reply via email to