Hi Srikar, Srikar Dronamraju <sri...@linux.vnet.ibm.com> writes: > Currently the kernel detects if its running on a shared lpar platform > and requests home node associativity before the scheduler sched_domains > are setup. However between the time NUMA setup is initialized and the > request for home node associativity, workqueue initializes its per node > cpumask. The per node workqueue possible cpumask may turn invalid > after home node associativity resulting in weird situations like > workqueue possible cpumask being a subset of workqueue online cpumask. > > This can be fixed by requesting home node associativity earlier just > before NUMA setup. However at the NUMA setup time, kernel may not be in > a position to detect if its running on a shared lpar platform. So > request for home node associativity and if the request fails, fallback > on the device tree property. > > While here, fix a problem where of_node_put could be called even when > of_get_cpu_node was not successful.
of_node_put() handles NULL arguments, so this should not be necessary. > +static int vphn_get_nid(unsigned long cpu, bool get_hwid) [...] > +static int numa_setup_cpu(unsigned long lcpu, bool get_hwid) [...] > @@ -528,7 +561,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu) > { > int nid; > > - nid = numa_setup_cpu(cpu); > + nid = numa_setup_cpu(cpu, true); > verify_cpu_node_mapping(cpu, nid); > return 0; > } > @@ -875,7 +908,7 @@ void __init mem_topology_setup(void) > reset_numa_cpu_lookup_table(); > > for_each_present_cpu(cpu) > - numa_setup_cpu(cpu); > + numa_setup_cpu(cpu, false); > } I'm open to other points of view here, but I would prefer two separate functions, something like vphn_get_nid() for runtime and vphn_get_nid_early() (which could be __init) for boot-time initialization. Propagating a somewhat unexpressive boolean flag through two levels of function calls in this code is unappealing... Regardless, I have an annoying question :-) Isn't it possible that, while Linux is calling vphn_get_nid() for each logical cpu in sequence, the platform could change a virtual processor's node assignment, potentially causing sibling threads to get different node assignments and producing an incoherent topology (which then leads to sched domain assertions etc)? If so, I think more care is needed. The algorithm should make the vphn call only once per cpu node, I think?