Hi Srikar,
Srikar Dronamraju <sri...@linux.vnet.ibm.com> writes:
> Currently the kernel detects if its running on a shared lpar platform
> and requests home node associativity before the scheduler sched_domains
> are setup. However between the time NUMA setup is initialized and the
> request for home node associativity, workqueue initializes its per node
> cpumask. The per node workqueue possible cpumask may turn invalid
> after home node associativity resulting in weird situations like
> workqueue possible cpumask being a subset of workqueue online cpumask.
>
> This can be fixed by requesting home node associativity earlier just
> before NUMA setup. However at the NUMA setup time, kernel may not be in
> a position to detect if its running on a shared lpar platform. So
> request for home node associativity and if the request fails, fallback
> on the device tree property.
I think this is generally sound at the conceptual level.
> However home node associativity requires cpu's hwid which is set in
> smp_setup_pacas. Hence call smp_setup_pacas before numa_setup_cpus.
But this seems like it would negatively affect pacas' NUMA placements?
Would it be less risky to figure out a way to do "early" VPHN hcalls
before mem_topology_setup, getting the hwids from the cpu_to_phys_id
array perhaps?
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 88b5157..7965d3b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -461,6 +461,21 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
> return nid;
> }
>
> +static int vphn_get_nid(unsigned long cpu)
> +{
> + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> + long rc;
> +
> + /* Use associativity from first thread for all siblings */
I don't understand how this comment corresponds to the code it
accompanies.
> + rc = hcall_vphn(get_hard_smp_processor_id(cpu),
> + VPHN_FLAG_VCPU, associativity);
> +
> + if (rc == H_SUCCESS)
> + return associativity_to_nid(associativity);
^^ extra space
> @@ -490,7 +505,18 @@ static int numa_setup_cpu(unsigned long lcpu)
> goto out;
> }
>
> - nid = of_node_to_nid_single(cpu);
> + /*
> + * On a shared lpar, the device tree might not have the correct node
> + * associativity. At this time lppaca, or its __old_status field
Sorry but I'm going to quibble with this phrasing a bit. On SPLPAR the
CPU nodes have no affinity information in the device tree at all. This
comment implies that they may have incorrect information, which is
AFAIK not the case.