On 3/17/21 2:00 PM, Daniel Henrique Barboza wrote: > Hello, > > Patch 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") > introduced > a regression in both upstream and RHEL downstream kernels [1]. The assumption > made > in the commit: > > "Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would > be > equal on Power" > > Doesn't seem to be true. After this commit, QEMU is now unable to set single > NUMA > node SMP topologies such as: > > -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 > > lscpu will give the following output in this case: > > # lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 8 > On-line CPU(s) list: 0-7 > Thread(s) per core: 2 > Core(s) per socket: 4 > Socket(s): 1 > NUMA node(s): 1 > Model: 2.2 (pvr 004e 1202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: KVM > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > NUMA node0 CPU(s): 0-7 > > > This is happening because the macro cpu_cpu_mask(cpu) expands to > cpumask_of_node(cpu_to_node(cpu)), which in turn expands to > node_to_cpumask_map[node]. > node_to_cpumask_map is a NUMA array that maps CPUs to NUMA nodes (Aneesh is > on CC to > correct me if I'm wrong). We're now associating sockets to NUMA nodes > directly. > > If I add a second NUMA node then I can get the intended smp topology: > > -smp 8,maxcpus=8,cores=2,threads=2,sockets=2 > -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ > -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ > > # lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 8 > On-line CPU(s) list: 0-7 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 2 > NUMA node(s): 2 > Model: 2.2 (pvr 004e 1202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: KVM > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > NUMA node0 CPU(s): 0-3 > NUMA node1 CPU(s): 4-7 > > > However, if I try a single socket with multiple NUMA nodes topology, which is > the case > of Power10, e.g.: > > > -smp 8,maxcpus=8,cores=4,threads=2,sockets=1 > -numa node,memdev=mem0,cpus=0-3,nodeid=0 \ > -numa node,memdev=mem1,cpus=4-7,nodeid=1 \ > > > This is the result: > > # lscpu > Architecture: ppc64le > Byte Order: Little Endian > CPU(s): 8 > On-line CPU(s) list: 0-7 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 2 > NUMA node(s): 2 > Model: 2.2 (pvr 004e 1202) > Model name: POWER9 (architected), altivec supported > Hypervisor vendor: KVM > Virtualization type: para > L1d cache: 32K > L1i cache: 32K > NUMA node0 CPU(s): 0-3 > NUMA node1 CPU(s): 4-7 > > > This confirms my suspicions that, at this moment, we're making sockets == > NUMA nodes.
Yes. I don't think we can do better on PAPR and the above examples seem to confirm that the "sockets" definition is simply ignored. > Cedric, the reason I'm CCing you is because this is related to ibm,chip-id. > The commit > after the one that caused the regression, 4ca234a9cbd7c3a65 ("powerpc/smp: > Stop updating > cpu_core_mask"), is erasing the code that calculated cpu_core_mask. > cpu_core_mask, despite > its shortcomings that caused its removal, was giving a precise SMP topology. > And it was > using physical_package_id/'ibm,chip-id' for that. ibm,chip-id is a no-no on pSeries. I guess this is inherent to PAPR which is hiding a lot of the underlying HW and topology. May be we are trying to reconcile two orthogonal views of machine virtualization ... > Checking in QEMU I can say that the ibm,chip-id calculation is the only place > in the code > that cares about cores per socket information. The kernel is now ignoring > that, starting > on 4bce545903fa, and now QEMU is unable to provide this info to the guest. > > If we're not going to use ibm,chip-id any longer, which seems sensible given > that PAPR does > not declare it, we need another way of letting the guest know how much cores > per socket > we want. The RTAS call "ibm,get-system-parameter" with token "Processor Module Information" returns that kind of information : 2 byte binary number (N) of module types followed by N module specifiers of the form: 2 byte binary number (M) of sockets of this module type 2 byte binary number (L) of chips per this module type 2 byte binary number (K) of cores per chip in this module type. See the values in these sysfs files : cat /sys/devices/hv_24x7/interface/{sockets,chipspersocket,coresperchip} But I am afraid these are host level information and not guest/LPAR. I didn't find any LPAR level properties or hcalls in the PAPR document. They need to be specified. or We can add extra properties like ibm,chip-id but making sure it's only used under the KVM hypervisor. My understanding is that's something we are trying to avoid. C.