Aside of the BIOS config, are you sure that you have the exact same BIOS
*version* in each node? (can check in /sys/class/dmi/id/bios_*) Same
Linux kernel too?

Also, recently we've seen somebody fix such problems by unplugging and
replugging some CPUs on the motherboard. Seems crazy but it happened for
real...

By the way, your discussion of groups 1 and 2 below is wrong. Group 2
doesn't say that NUMA node == socket, and it doesn't report 8 sockets of
8 cores each. It reports 4 sockets containing 2 NUMA nodes each
containing 8 cores each, and that's likely what you have here (AMD
Opteron 6300 or 6200 processors?).

Brice



Le 28/05/2014 12:27, Craig Kapfer a écrit :
> We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and 
> some of them are reporting the following error from slurm, which I gather 
> gets its info from hwloc:
>
>     May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from 
> hardware: CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) 
> CoresPerSocket=16:8(hw) ThreadsPerCore=1:1(hw)
>
> All nodes have the exact same CPUs, motherboards and OS (PXE booted from the 
> same master image even).  The bios settings between nodes also look the same. 
>  The nodes only differ in the amount of memory and number of DIMMs.  
> There are two sets of nodes with different output from lstopo:
>
> Group 1 (correct): reporting 4 sockets with 16 cores per socket
> Group 2 (incorrect): reporting 8 sockets with 8 cores per socket
>
> Group 2 seems to be (incorrectly?) taking numanodes as sockets.
>
> The output of lstopo is slightly different in the two groups, note the extra 
> Socket layer for group 2:
>
> Group 1: 
> Machine (128GB)
>   NUMANode L#0 (P#0 32GB) + Socket L#0
>   #16 cores listed
>   <snip>
>   NUMANode L#1 (P#2 32GB) + Socket L#1
>   #16 cores listed
>   etc
> <snip>
>
> Group 2:
> Machine (256GB)
>   Socket L#0 (64GB)
>     NUMANode L#0 (P#0 32GB) + L3 L#0 (6144KB)
>     # 8 cores listed
>     <snip>
>     NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB)
>     # 8 cores listed
>     <snip>
>   Socket L#1 (64GB)
>     NUMANode L#2 (P#2 32GB) + L3 L#2 (6144KB)
>     # 8 cores listed
>     etc
> <snip>
>
> The group 2 reporting doesn't match our hardware, at least as far as sockets 
> and cores per socket goes--is there a reason other than the memory 
> configuration that could cause this? 
> Thanks,
> Craig
>
>
>
> _______________________________________________
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users

Reply via email to