Last time this came up, there was concern about whether we can trim the mm cpumask, and what concurrency there is vs use_mm(). I've had more of a look and still think this is okay. I haven't thought of a good way to add debug checks to ensure it though.
When doing a parallel kernel build on a 2 socket P9 system, this series causes tlbie (broadcast) to go from 1.37 million (22k/sec) to 181 thousand (3k/sec). tlbiel (local) flushes increase from 20.2 to 23.7 million. Due to requiring 128 tlbiel (vs 1 tlbie) to flush a PID, and also we set the cutoff higher before we switch from va range to full PID flush, when doing tlbiel. End result performance was very little changed, very tiny improvement maybe but well under 1%. Kernel compile mostly stays off the interconnect, and this is a small system, and without nMMU involvement. Any of these factors could make broadcast tlbie reduction more important. Remaining work - ensuring correctness of this stuff, implementations for hash, understanding and testing nMMU cases better, using IPIs for some/all types of invalidations, then possibly looking at doing something more fancy with the PID allocator. Nicholas Piggin (3): powerpc/64s: do not flush TLB when relaxing access powerpc/64s/radix: reset mm_cpumask for single thread process when possible powerpc/64s: always flush non-local CPUs from single threaded mms arch/powerpc/include/asm/mmu_context.h | 33 ++++++++++---- arch/powerpc/include/asm/tlb.h | 7 +++ arch/powerpc/mm/pgtable-book3s64.c | 1 - arch/powerpc/mm/pgtable.c | 3 +- arch/powerpc/mm/tlb-radix.c | 78 +++++++++++++++++++++++++--------- 5 files changed, 92 insertions(+), 30 deletions(-) -- 2.16.1