Re: Advice needed on SMP regression after cpu_core_mask change

Daniel Henrique Barboza Wed, 17 Mar 2021 09:05:54 -0700



On 3/17/21 12:30 PM, Cédric Le Goater wrote:

On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote:

Hello,

Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") introduced
a regression in both upstream and RHEL downstream kernels [1]. The assumption 
made
in the commit:

"Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be
equal on Power"

Doesn't seem to be true. After this commit, QEMU is now unable to set single 
NUMA
node SMP topologies such as:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2

lscpu will give the following output in this case:

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7


This is happening because the macro cpu_cpu_mask(cpu) expands to
cpumask_of_node(cpu_to_node(cpu)), which in turn expands to 
node_to_cpumask_map[node].
node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is on 
CC to
correct me if I'm wrong). We're now associating sockets to NUMA nodes directly.

If I add a second NUMA node then I can get the intended smp topology:

-smp 8,maxcpus=8,cores=2,threads=2,sockets=2
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


However, if I try a single socket with multiple NUMA nodes topology, which is 
the case
of Power10, e.g.:


-smp 8,maxcpus=8,cores=4,threads=2,sockets=1
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \


This is the result:

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-3
NUMA node1 CPU(s):   4-7


This confirms my suspicions that, at this moment, we're making sockets == NUMA 
nodes.


Yes. I don't think we can do better on PAPR and the above examples
seem to confirm that the "sockets" definition is simply ignored.

Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. The 
commit
after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: Stop 
updating
cpu_core_mask"), is erasing the code that calculated cpu_core_mask. 
cpu_core_mask, despite
its shortcomings that caused its removal, was giving a precise SMP topology. 
And it was
using physical_package_id/'ibm,chip-id' for that.


ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which
is hiding a lot of the underlying HW and topology. May be we are trying
to reconcile two orthogonal views of machine virtualization ...

Checking in QEMU I can say that the ibm,chip-id calculation is the only place 
in the code
that cares about cores per socket information. The kernel is now ignoring that, 
starting
on 4bce545903fa, and now QEMU is unable to provide this info to the guest.

If we're not going to use ibm,chip-id any longer, which seems sensible given 
that PAPR does
not declare it, we need another way of letting the guest know how much cores 
per socket
we want.

The RTAS call "ibm,get-system-parameter" with token "Processor Module
Information" returns that kind of information :

   2 byte binary number (N) of module types followed by N module specifiers of 
the form:
   2 byte binary number (M) of sockets of this module type
   2 byte binary number (L) of chips per this module type
   2 byte binary number (K) of cores per chip in this module type.

See the values in these sysfs files :

   cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip}

But I am afraid these are host level information and not guest/LPAR.



I believe there might be some sort of reasoning behind not having this on
PAPR, but I'll say in advance that the virtual machine should act as the
real hardware, as close as possible. This is the kind of hcall that could
be used in this situation.


I didn't find any LPAR level properties or hcalls in the PAPR document.
They need to be specified.

or

We can add extra properties like ibm,chip-id but making sure it's only
used under the KVM hypervisor. My understanding is that's something we
are trying to avoid.


We can change PAPR to add ibm,chip-id. Problem is that ibm,chip-id today, with
the current kernel codebase, does not fix the issue because the code is
ignoring it hehehe


If we're going to change PAPR -  and I believe we should, there's a clear
lack of proper support for SMP topologies - we're better make sure that whatever
attribute/hcall we add there fixes it in a robust way for the long term.


Thanks,


DHB

C.

Re: Advice needed on SMP regression after cpu_core_mask change

Reply via email to