Hmm ... the slurm config defines that all nodes have 4 sockets with 16 cores 
per socket (which corresponds to the hardware--all nodes are the same).   Slurm 
node config is as follows:

NodeName=n[001-008] RealMemory=258452 Sockets=4 CoresPerSocket=16 
ThreadsPerCore=1 State=UNKNOWN Port=[17001-17008]

But we get this error--so I suspect it's a parsing error on the slurm side?:

May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware: 
CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) CoresPerSocket=16:8(hw) 
ThreadsPerCore=1:1(hw)

Craig

On Wednesday, May 28, 2014 3:20 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
 


Le 28/05/2014 14:13, Craig Kapfer a écrit :

Interesting, quite right, thank you very much.  Yes these are AMD 6300 series.  
Same kernel but these boxes seem to have different BIOS versions, direct from 
the factory, delivered in the same physical enclosure even!  Some are AMI 3.5 
and some are 3.0.
>
>
>So slurm is then incorrectly parsing correct output from lstopo to generate 
>this message?
>
>May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware: 
>CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) CoresPerSocket=16:8(hw) 
>ThreadsPerCore=1:1(hw)
It's saying "there are 8 sockets with 8 cores in hw instead of 4
    sockets with 16 cores each in config" ?
My feeling is that Slurm just has a (valid) config that says group2
    while it was running on group1 in this case.

Brice




Thanks much,
>
>
>
>Craig
>
>
>
>On Wednesday, May 28, 2014 1:39 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
> 
>
>
>Aside of the BIOS config, are you sure that you have the exact same BIOS 
>*version* in each node? (can check in /sys/class/dmi/id/bios_*) Same Linux 
>kernel too?
>
>Also, recently we've seen somebody fix such
                      problems by unplugging and replugging some CPUs on
                      the motherboard. Seems crazy but it happened for
                      real...
>
>By the way, your discussion of groups 1 and 2
                      below is wrong. Group 2 doesn't say that NUMA node
                      == socket, and it doesn't report 8 sockets of 8
                      cores each. It reports 4 sockets containing 2 NUMA
                      nodes each containing 8 cores each, and that's
                      likely what you have here (AMD Opteron 6300 or
                      6200 processors?).
>
>Brice
>
>
>
>
>Le 28/05/2014 12:27, Craig Kapfer a écrit :
>
>We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and some 
>of them are reporting the following error from slurm, which I gather gets its 
>info from hwloc: 
>>May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware: 
>>CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) CoresPerSocket=16:8(hw) 
>>ThreadsPerCore=1:1(hw)
>>All nodes have the exact same CPUs, motherboards and OS (PXE booted from the 
>>same master image even).  The bios settings between nodes also look the same. 
>> The nodes only differ in the amount of memory and number of DIMMs.  
>>There are two sets of nodes with different output from lstopo: Group 1 
>>(correct): reporting 4 sockets with 16 cores per socket
Group 2 (incorrect): reporting 8 sockets with 8 cores per socket Group 2 seems 
to be (incorrectly?) taking numanodes as sockets. The output of lstopo is 
slightly different in the two groups, note the extra Socket layer for group 2: 
Group 1: Machine (128GB) NUMANode L#0 (P#0 32GB) + Socket L#0 #16 cores listed 
<snip> NUMANode L#1 (P#2 32GB) + Socket L#1 #16 cores listed etc
<snip> Group 2: Machine (256GB) Socket L#0 (64GB) NUMANode L#0 (P#0 32GB) + L3 
L#0 (6144KB) # 8 cores listed <snip> NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB) 
# 8 cores listed <snip> Socket L#1 (64GB) NUMANode L#2 (P#2 32GB) + L3 L#2 
(6144KB) # 8 cores listed etc
<snip> The group 2 reporting doesn't match our hardware, at least as far as 
sockets and cores per socket goes--is there a reason other than the memory 
configuration that could cause this? 
>>Thanks,
>>Craig
>>
>>
>>
>>
>>_______________________________________________
hwloc-users mailing list hwloc-us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
>
>

Reply via email to