On Tue, 31 Aug 2021 11:13:45 -0300 Daniel Henrique Barboza <danielhb...@gmail.com> wrote:
> > > On 8/31/21 10:40 AM, Greg Kurz wrote: > > On Fri, 27 Aug 2021 06:24:53 -0300 > > Daniel Henrique Barboza <danielhb...@gmail.com> wrote: > > > >> The main feature of FORM2 affinity support is the separation of NUMA > >> distances from ibm,associativity information. This allows for a more > >> flexible and straightforward NUMA distance assignment without relying on > >> complex associations between several levels of NUMA via > >> ibm,associativity matches. Another feature is its extensibility. This base > >> support contains the facilities for NUMA distance assignment, but in the > >> future more facilities will be added for latency, performance, bandwidth > >> and so on. > >> > >> This patch implements the base FORM2 affinity support as follows: > >> > >> - the use of FORM2 associativity is indicated by using bit 2 of byte 5 > >> of ibm,architecture-vec-5. A FORM2 aware guest can choose to use FORM1 > >> or FORM2 affinity. Setting both forms will default to FORM2. We're not > >> advertising FORM2 for pseries-6.1 and older machine versions to prevent > >> guest visible changes in those; > >> > >> - move spapr_numa_associativity_init() from spapr_machine_init() to > >> do_client_architecture_support(), when we already know which NUMA > >> affinity the guest will use. This will avoid initializing FORM1 > >> spapr->numa_assoc_array and overwrite it shortly after if FORM2 is > >> chosen; > >> > > > > As I was saying in another mail. I'm not very comfortable with delaying > > spapr_numa_associativity_init() to CAS until I'm sure nothing else > > assumes it's already done when the machine first resets and boots. > > > > And also, these are slow paths and I'd rather overwrite the array if > > that keeps the code simple. FWIW this is what we already do with the > > irq backend : we always reset to XICS and switch to XIVE at CAS. > > > > This also makes me think that spapr_numa_associativity_init() should > > be called during machine reset so that spapr->numa_assoc_array is > > reset to a known default value, i.e. FORM1 or legacy depending on > > the machine version. Maybe rename it to spapr_numa_associativity_reset() > > for clarity ? > > It is not tragic to delay spapr_numa_associativity_init() up until CAS > because we haven't written any ibm,associativity arrays yet. What we > write before CAS are the common RTAS artifacts such as reference-points > and maxdomains, but as you suggested in the previous review, we're just > writing them again if the guest chose to use FORM2. > > If we really want to be safe I can do the following: > > - keep spapr_numa_associativity_init() in machine_init(), like it is today; > I think it should be performed at machine reset actually. > - call it again after CAS if the guest chooses FORM2; > > - change spapr_numa_associativity_init() to clear the unused FORM 1 > associativity > domains if FORM2 is chosen. This step is needed because there's no way to > know if > we're before or after CAS, and FORM1 code populates the associativity domains > based on user NUMA distance input. For clarity, I want to keep unused stuff > zeroed when using FORM2. > > > Thanks, > > > Daniel > > > > >> - ibm,associativity-reference-points has a new semantic. Instead of > >> being used to calculate distances via NUMA levels, it's now used to > >> indicate the primary domain index in the ibm,associativity domain of > >> each resource. In our case it's set to {0x4}, matching the position > >> where we already place logical_domain_id; > >> > >> - two new RTAS DT artifacts are introduced: ibm,numa-lookup-index-table > >> and ibm,numa-distance-table. The index table is used to list all the > >> NUMA logical domains of the platform, in ascending order, and allows for > >> spartial NUMA configurations (although QEMU ATM doesn't support that). > >> ibm,numa-distance-table is an array that contains all the distances from > >> the first NUMA node to all other nodes, then the second NUMA node > >> distances to all other nodes and so on; > >> > >> - spapr_post_load changes: since we're adding a new NUMA affinity that > >> isn't compatible with the existing one, migration must be handled > >> accordingly because we can't be certain of whether the guest went > >> through CAS in the source. The solution chosen is to initiate the NUMA > >> associativity data in spapr_post_load() unconditionally. The worst case > >> would be to write the DT twice if the guest is in pre-CAS stage. > >> Otherwise, we're making sure that a FORM1 guest will have the > >> spapr->numa_assoc_array initialized with the proper information based on > >> user distance, something that we're not doing with FORM2. > >> > >> Signed-off-by: Daniel Henrique Barboza <danielhb...@gmail.com> > >> --- > >> hw/ppc/spapr.c | 27 +++++++- > >> hw/ppc/spapr_hcall.c | 4 ++ > >> hw/ppc/spapr_numa.c | 127 +++++++++++++++++++++++++++++++++++- > >> include/hw/ppc/spapr.h | 1 + > >> include/hw/ppc/spapr_ovec.h | 1 + > >> 5 files changed, 156 insertions(+), 4 deletions(-) > >> > >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >> index d39fd4e644..a3eb33764d 100644 > >> --- a/hw/ppc/spapr.c > >> +++ b/hw/ppc/spapr.c > >> @@ -1786,6 +1786,22 @@ static int spapr_post_load(void *opaque, int > >> version_id) > >> return err; > >> } > >> > >> + /* > >> + * NUMA data init is made in CAS time. There is no reliable > >> + * way of telling whether the guest already went through CAS > >> + * in the source due to how spapr_ov5_cas_needed works: a > >> + * FORM1 guest can be migrated with ov5_cas empty regardless > >> + * of going through CAS first. > >> + * > >> + * One solution is to always call numa_associativity_init. The > >> + * downside is that a guest migrated before CAS will run > >> + * numa_associativity_init again when going through it, but > >> + * at least we're making sure spapr->numa_assoc_array will be > >> + * initialized and hotplug operations won't fail in both before > >> + * and after CAS migration cases. > > > > This seems to comfort my feelings : it is safer to do the init > > unconditionally to ensure no other code gets upset. > > > >> + */ > >> + spapr_numa_associativity_init(spapr, MACHINE(spapr)); > >> + > >> return err; > >> } > >> > >> @@ -2752,6 +2768,11 @@ static void spapr_machine_init(MachineState > >> *machine) > >> > >> spapr_ovec_set(spapr->ov5, OV5_FORM1_AFFINITY); > >> > >> + /* Do not advertise FORM2 support for pseries-6.1 and older */ > >> + if (!smc->pre_6_2_numa_affinity) { > >> + spapr_ovec_set(spapr->ov5, OV5_FORM2_AFFINITY); > >> + } > >> + > >> /* advertise support for dedicated HP event source to guests */ > >> if (spapr->use_hotplug_event_source) { > >> spapr_ovec_set(spapr->ov5, OV5_HP_EVT); > >> @@ -2808,9 +2829,6 @@ static void spapr_machine_init(MachineState *machine) > >> > >> spapr->gpu_numa_id = spapr_numa_initial_nvgpu_numa_id(machine); > >> > >> - /* Init numa_assoc_array */ > >> - spapr_numa_associativity_init(spapr, machine); > >> - > >> if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) && > >> ppc_type_check_compat(machine->cpu_type, > >> CPU_POWERPC_LOGICAL_3_00, 0, > >> spapr->max_compat_pvr)) { > >> @@ -4700,8 +4718,11 @@ DEFINE_SPAPR_MACHINE(6_2, "6.2", true); > >> */ > >> static void spapr_machine_6_1_class_options(MachineClass *mc) > >> { > >> + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); > >> + > >> spapr_machine_6_2_class_options(mc); > >> compat_props_add(mc->compat_props, hw_compat_6_1, hw_compat_6_1_len); > >> + smc->pre_6_2_numa_affinity = true; > >> } > >> > >> DEFINE_SPAPR_MACHINE(6_1, "6.1", false); > >> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c > >> index 0e9a5b2e40..1cc88716c1 100644 > >> --- a/hw/ppc/spapr_hcall.c > >> +++ b/hw/ppc/spapr_hcall.c > >> @@ -11,6 +11,7 @@ > >> #include "helper_regs.h" > >> #include "hw/ppc/spapr.h" > >> #include "hw/ppc/spapr_cpu_core.h" > >> +#include "hw/ppc/spapr_numa.h" > >> #include "mmu-hash64.h" > >> #include "cpu-models.h" > >> #include "trace.h" > >> @@ -1197,6 +1198,9 @@ target_ulong > >> do_client_architecture_support(PowerPCCPU *cpu, > >> spapr->cas_pre_isa3_guest = !spapr_ovec_test(ov1_guest, > >> OV1_PPC_3_00); > >> spapr_ovec_cleanup(ov1_guest); > >> > >> + /* Init numa_assoc_array */ > >> + spapr_numa_associativity_init(spapr, MACHINE(spapr)); > >> + > >> /* > >> * Ensure the guest asks for an interrupt mode we support; > >> * otherwise terminate the boot. > >> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c > >> index 04a86f9b5b..0a5fa8101e 100644 > >> --- a/hw/ppc/spapr_numa.c > >> +++ b/hw/ppc/spapr_numa.c > >> @@ -202,6 +202,16 @@ void spapr_numa_associativity_init(SpaprMachineState > >> *spapr, > >> spapr->numa_assoc_array[i][0] = > >> cpu_to_be32(MAX_DISTANCE_REF_POINTS); > >> spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = > >> cpu_to_be32(i); > >> > >> + /* > >> + * For FORM2 affinity the initialization above is enough. No > >> + * need to fill non-zero NUMA nodes with node_id because > >> + * there is no associativity domain match to calculate > >> + * NUMA distances in FORM2. > >> + */ > >> + if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) { > >> + continue; > >> + } > >> + > >> /* > >> * Fill all associativity domains of non-zero NUMA nodes with > >> * node_id. This is required because the default value (0) is > >> @@ -236,7 +246,16 @@ void spapr_numa_associativity_init(SpaprMachineState > >> *spapr, > >> spapr->numa_assoc_array[i][MAX_DISTANCE_REF_POINTS] = > >> cpu_to_be32(i); > >> } > >> > >> - spapr_numa_FORM1_affinity_init(spapr, machine); > >> + /* > >> + * We test for !FORM2 instead of testing for FORM1 because, > >> + * as per spapr_ov5_cas_needed, setting FORM1 is not enough > >> + * to get ov5_cas migrated, but setting FORM2 is. Since we're > >> + * dealing with either FORM1 or FORM2, test for the option > >> + * that is guaranteed to be set after a migration. > >> + */ > >> + if (!spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) { > >> + spapr_numa_FORM1_affinity_init(spapr, machine); > >> + } > >> } > >> > >> void spapr_numa_write_associativity_dt(SpaprMachineState *spapr, void > >> *fdt, > >> @@ -313,6 +332,107 @@ int > >> spapr_numa_write_assoc_lookup_arrays(SpaprMachineState *spapr, void *fdt, > >> return ret; > >> } > >> > >> +static void spapr_numa_FORM2_write_rtas_tables(SpaprMachineState *spapr, > >> + void *fdt, int rtas) > >> +{ > >> + MachineState *ms = MACHINE(spapr); > >> + NodeInfo *numa_info = ms->numa_state->nodes; > >> + int nb_numa_nodes = ms->numa_state->num_nodes; > >> + int distance_table_entries = nb_numa_nodes * nb_numa_nodes; > >> + g_autofree uint32_t *lookup_index_table = NULL; > >> + g_autofree uint32_t *distance_table = NULL; > >> + int src, dst, i, distance_table_size; > >> + uint8_t *node_distances; > >> + > >> + /* > >> + * ibm,numa-lookup-index-table: array with length and a > >> + * list of NUMA ids present in the guest. > >> + */ > >> + lookup_index_table = g_new0(uint32_t, nb_numa_nodes + 1); > >> + lookup_index_table[0] = cpu_to_be32(nb_numa_nodes); > >> + > >> + for (i = 0; i < nb_numa_nodes; i++) { > >> + lookup_index_table[i + 1] = cpu_to_be32(i); > >> + } > >> + > >> + _FDT(fdt_setprop(fdt, rtas, "ibm,numa-lookup-index-table", > >> + lookup_index_table, > >> + (nb_numa_nodes + 1) * sizeof(uint32_t))); > >> + > >> + /* > >> + * ibm,numa-distance-table: contains all node distances. First > >> + * element is the size of the table as uint32, followed up > >> + * by all the uint8 distances from the first NUMA node, then all > >> + * distances from the second NUMA node and so on. > >> + * > >> + * ibm,numa-lookup-index-table is used by guest to navigate this > >> + * array because NUMA ids can be sparse (node 0 is the first, > >> + * node 8 is the second ...). > >> + */ > >> + distance_table = g_new0(uint32_t, distance_table_entries + 1); > >> + distance_table[0] = cpu_to_be32(distance_table_entries); > >> + > >> + node_distances = (uint8_t *)&distance_table[1]; > >> + i = 0; > >> + > >> + for (src = 0; src < nb_numa_nodes; src++) { > >> + for (dst = 0; dst < nb_numa_nodes; dst++) { > >> + node_distances[i++] = numa_info[src].distance[dst]; > > > > It looks like you could s/i++/src + dst/ here. > > > >> + } > >> + } > >> + > >> + distance_table_size = distance_table_entries * sizeof(uint8_t) + > >> + sizeof(uint32_t); > >> + _FDT(fdt_setprop(fdt, rtas, "ibm,numa-distance-table", > >> + distance_table, distance_table_size)); > >> +} > >> + > >> +/* > >> + * This helper could be compressed in a single function with > >> + * FORM1 logic since we're setting the same DT values, with the > >> + * difference being a call to spapr_numa_FORM2_write_rtas_tables() > >> + * in the end. The separation was made to avoid clogging FORM1 code > >> + * which already has to deal with compat modes from previous > >> + * QEMU machine types. > >> + */ > >> +static void spapr_numa_FORM2_write_rtas_dt(SpaprMachineState *spapr, > >> + void *fdt, int rtas) > >> +{ > >> + MachineState *ms = MACHINE(spapr); > >> + uint32_t number_nvgpus_nodes = spapr->gpu_numa_id - > >> + spapr_numa_initial_nvgpu_numa_id(ms); > >> + > >> + /* > >> + * In FORM2, ibm,associativity-reference-points will point to > >> + * the element in the ibm,associativity array that contains the > >> + * primary domain index. This value (in our case, the numa-id) is > >> + * then used as an index to retrieve all other attributes of the > >> + * node (distance, bandwidth, latency) via ibm,numa-lookup-index-table > >> + * and other ibm,numa-*-table properties. > >> + */ > >> + uint32_t refpoints[] = { > >> + cpu_to_be32(0x4), > >> + }; > >> + > >> + uint32_t nr_refpoints = ARRAY_SIZE(refpoints); > >> + uint32_t maxdomain = ms->numa_state->num_nodes + number_nvgpus_nodes; > >> + uint32_t maxdomains[] = { > >> + cpu_to_be32(4), > > > > Maybe use same base for initializers, especially when they land in the same > > hunk :) Either dec or hex I don't care. > > > > Rest LGTM. > > > >> + cpu_to_be32(maxdomain), > >> + cpu_to_be32(maxdomain), > >> + cpu_to_be32(maxdomain), > >> + cpu_to_be32(maxdomain) > >> + }; > >> + > >> + _FDT(fdt_setprop(fdt, rtas, "ibm,associativity-reference-points", > >> + refpoints, nr_refpoints * sizeof(refpoints[0]))); > >> + > >> + _FDT(fdt_setprop(fdt, rtas, "ibm,max-associativity-domains", > >> + maxdomains, sizeof(maxdomains))); > >> + > >> + spapr_numa_FORM2_write_rtas_tables(spapr, fdt, rtas); > >> +} > >> + > >> static void spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr, > >> void *fdt, int rtas) > >> { > >> @@ -379,6 +499,11 @@ static void > >> spapr_numa_FORM1_write_rtas_dt(SpaprMachineState *spapr, > >> */ > >> void spapr_numa_write_rtas_dt(SpaprMachineState *spapr, void *fdt, int > >> rtas) > >> { > >> + if (spapr_ovec_test(spapr->ov5_cas, OV5_FORM2_AFFINITY)) { > >> + spapr_numa_FORM2_write_rtas_dt(spapr, fdt, rtas); > >> + return; > >> + } > >> + > >> spapr_numa_FORM1_write_rtas_dt(spapr, fdt, rtas); > >> } > >> > >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > >> index 637652ad16..21b1cf9ebf 100644 > >> --- a/include/hw/ppc/spapr.h > >> +++ b/include/hw/ppc/spapr.h > >> @@ -145,6 +145,7 @@ struct SpaprMachineClass { > >> hwaddr rma_limit; /* clamp the RMA to this size */ > >> bool pre_5_1_assoc_refpoints; > >> bool pre_5_2_numa_associativity; > >> + bool pre_6_2_numa_affinity; > >> > >> bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index, > >> uint64_t *buid, hwaddr *pio, > >> diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h > >> index 48b716a060..c3e8b98e7e 100644 > >> --- a/include/hw/ppc/spapr_ovec.h > >> +++ b/include/hw/ppc/spapr_ovec.h > >> @@ -49,6 +49,7 @@ typedef struct SpaprOptionVector SpaprOptionVector; > >> /* option vector 5 */ > >> #define OV5_DRCONF_MEMORY OV_BIT(2, 2) > >> #define OV5_FORM1_AFFINITY OV_BIT(5, 0) > >> +#define OV5_FORM2_AFFINITY OV_BIT(5, 2) > >> #define OV5_HP_EVT OV_BIT(6, 5) > >> #define OV5_HPT_RESIZE OV_BIT(6, 7) > >> #define OV5_DRMEM_V2 OV_BIT(22, 0) > >