Hi Tejun, [I found the other thread where you made these points, thanks you for expressing them so clearly again!]
On Fri, Jul 18, 2014 at 11:00 AM, Tejun Heo <t...@kernel.org> wrote: > > Hello, > > On Fri, Jul 18, 2014 at 10:42:29AM -0700, Nish Aravamudan wrote: > > So, to be clear, this is not *necessarily* about memoryless nodes. It's > > about the semantics intended. The workqueue code currently calls > > cpu_to_node() in a few places, and passes that node into the core MM as a > > hint about where the memory should come from. However, when memoryless > > nodes are present, that hint is guaranteed to be wrong, as it's the nearest > > NUMA node to the CPU (which happens to be the one its on), not the nearest > > NUMA node with memory. The hint is correctly specified as cpu_to_mem(), > > It's telling the allocator the node the CPU is on. Choosing and > falling back the actual allocation is the allocator's job. Ok, I agree with you then, if that's all the semantic is supposed to be. But looking at the comment for kthread_create_on_node: * If thread is going to be bound on a particular cpu, give its node * in @node, to get NUMA affinity for kthread stack, or else give -1. so the API interprets it as a suggestion for the affinity itself, *not* the node the kthread should be on. Piddly, yes, but actually I have another thought altogether, and in reviewing Jiang's patches this seems like the right approach: why aren't these callers using kthread_create_on_cpu()? That API was already change to use cpu_to_mem() [so one change, rather than of all over the kernel source]. We could change it back to cpu_to_node and push down the knowledge about the fallback. > > which does the right thing in the presence or absence of memoryless nodes. > > And I think encapsulates the hint's semantics correctly -- please give me > > memory from where I expect it, which is the closest NUMA node. > > I don't think it does. It loses information at too high a layer. > Workqueue here doesn't care how memory subsystem is structured, it's > just telling the allocator where it's at and expecting it to do the > right thing. Please consider the following scenario. > > A - B - C - D - E > > Let's say C is a memory-less node. If we map from C to either B or D > from individual users and that node can't serve that memory request, > the allocator would fall back to A or E respectively when the right > thing to do would be falling back to D or B respectively, right? Yes, this is a good point. But honestly, we're not really even to the point of talking about fallback here, at least in my testing, going off-node at all causes SLUB-configured slabs to deactivate, which then leads to an explosion in the unreclaimable slab. > This isn't a huge issue but it shows that this is the wrong layer to > deal with this issue. Let the allocators express where they are. > Choosing and falling back belong to the memory allocator. That's the > only place which has all the information that's necessary and those > details must be contained there. Please don't leak it to memory > allocator users. Ok, I will continue to work at that level of abstraction. Thanks, Nish
_______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev