Srikar Dronamraju <sri...@linux.vnet.ibm.com> writes: > * Aneesh Kumar K.V <aneesh.ku...@linux.ibm.com> [2020-08-17 17:04:24]: > >> On 8/17/20 4:29 PM, Srikar Dronamraju wrote: >> > * Aneesh Kumar K.V <aneesh.ku...@linux.ibm.com> [2020-08-17 16:02:36]: >> > >> > > We use ibm,associativity and ibm,associativity-lookup-arrays to derive >> > > the numa >> > > node numbers. These device tree properties are firmware indicated >> > > grouping of >> > > resources based on their hierarchy in the platform. These numbers (group >> > > id) are >> > > not sequential and hypervisor/firmware can follow different numbering >> > > schemes. >> > > For ex: on powernv platforms, we group them in the below order. >> > > >> > > * - CCM node ID >> > > * - HW card ID >> > > * - HW module ID >> > > * - Chip ID >> > > * - Core ID >> > > >> > > Based on ibm,associativity-reference-points we use one of the above >> > > group ids as >> > > Linux NUMA node id. (On PowerNV platform Chip ID is used). This results >> > > in Linux reporting non-linear NUMA node id and which also results in >> > > Linux >> > > reporting empty node 0 NUMA nodes. >> > > >> > > This can be resolved by mapping the firmware provided group id to a >> > > logical Linux >> > > NUMA id. In this patch, we do this only for pseries platforms >> > > considering the >> > > firmware group id is a virtualized entity and users would not have drawn >> > > any >> > > conclusion based on the Linux Numa Node id. >> > > >> > > On PowerNV platform since we have historically mapped Chip ID as Linux >> > > NUMA node >> > > id, we keep the existing Linux NUMA node id numbering. >> > >> > I still dont understand how you are going to handle numa distances. >> > With your patch, have you tried dlpar add/remove on a sparsely noded >> > machine? >> > >> >> We follow the same steps when fetching distance information. Instead of >> using affinity domain id, we now use the mapped node id. The relevant hunk >> in the patch is >> >> + nid = affinity_domain_to_nid(&domain); >> >> if (nid > 0 && >> - of_read_number(associativity, 1) >= distance_ref_points_depth) { >> + of_read_number(associativity, 1) >= distance_ref_points_depth) { >> /* >> * Skip the length field and send start of associativity array >> */ >> >> I haven't tried dlpar add/remove. I don't have a setup to try that. Do you >> see a problem there? >> > > Yes, I think there can be 2 problems. > > 1. distance table may be filled with incorrect data. > 2. numactl -H distance table shows symmetric data, the symmetric nature may > be lost. >
After discussing with srikar to understand these concern better, below are the conclusions. 1) There is no corruption of node distance. We do handle node distance correctly. But the numactl -H output can be such that we won't find the numa nodes with a higher number to be further away from node 0. ie. We can find output like below. node 0 1 2 3 0: 10 40 40 20 1: 40 10 40 40 2: 40 40 10 40 3: 20 40 40 10 Here node 3 is closer to node 0 Than node 1 and 2. I am not sure this is going to break any userspace. 2) We can find node number changing if we do a DLPAR add of memory/cpu and reboot. ie, we boot with resource domain id 4 and 6 and then later add resources from domain 5. In this above case, we will have node 0,1 and 2 mapping domain id 4, 6, 5. On reboot, we can map them such that node 0 -> 4 node 1 -> 5 node 2 -> 6 I guess this is still ok because we are running in a virtualized environment and node numbers to domain id are never guaranteed to be he same across reboot. -aneesh