---------- Forwarded message --------- From: Yury Norov <[email protected]> Date: Sat, Nov 12, 2022 at 11:14 AM Subject: [PATCH v2 0/4] cpumask: improve on cpumask_local_spread() locality To: <[email protected]>, David S. Miller <[email protected]>, Andy Shevchenko <[email protected]>, Barry Song <[email protected]>, Ben Segall <[email protected]>, haniel Bristot de Oliveira <[email protected]>, Dietmar Eggemann <[email protected]>, Gal Pressman <[email protected]>, Greg Kroah-Hartman <[email protected]>, Heiko Carstens <[email protected]>, Ingo Molnar <[email protected]>, Jakub Kicinski <[email protected]>, Jason Gunthorpe <[email protected]>, Jesse Brandeburg <[email protected]>, Jonathan Cameron <[email protected]>, Juri Lelli <[email protected]>, Leon Romanovsky <[email protected]>, Mel Gorman <[email protected]>, Peter Zijlstra <[email protected]>, Rasmus Villemoes <[email protected]>, Saeed Mahameed <[email protected]>, Steven Rostedt <[email protected]>, Tariq Toukan <[email protected]>, Tariq Toukan <[email protected]>, Tony Luck <[email protected]>, Valentin Schneider <[email protected]>, Vincent Guittot <[email protected]> Cc: Yury Norov <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>
cpumask_local_spread() currently checks local node for presence of i'th CPU, and then if it finds nothing makes a flat search among all non-local CPUs. We can do it better by checking CPUs per NUMA hops. This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity hints" https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/ According to their measurements, for mlx5e: Bottleneck in RX side is released, reached linerate (~1.8x speedup). ~30% less cpu util on TX. This patch makes cpumask_local_spread() traversing CPUs based on NUMA distance, just as well, and I expect comparabale improvement for its users, as in case of mlx5e. I tested new behavior on my VM with the following NUMA configuration: root@debian:~# numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 node 0 size: 3869 MB node 0 free: 3740 MB node 1 cpus: 4 5 node 1 size: 1969 MB node 1 free: 1937 MB node 2 cpus: 6 7 node 2 size: 1967 MB node 2 free: 1873 MB node 3 cpus: 8 9 10 11 12 13 14 15 node 3 size: 7842 MB node 3 free: 7723 MB node distances: node 0 1 2 3 0: 10 50 30 70 1: 50 10 70 30 2: 30 70 10 50 3: 70 30 50 10 And the cpumask_local_spread() for each node and offset traversing looks like this: node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15 node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7 node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5 node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3 v1: https://lore.kernel.org/lkml/[email protected]/T/ v2: - use bsearch() in sched_numa_find_nth_cpu(); - fix missing 'static inline' in 3rd patch. Yury Norov (4): lib/find: introduce find_nth_and_andnot_bit cpumask: introduce cpumask_nth_and_andnot sched: add sched_numa_find_nth_cpu() cpumask: improve on cpumask_local_spread() locality include/linux/cpumask.h | 20 +++++++++++++++ include/linux/find.h | 33 ++++++++++++++++++++++++ include/linux/topology.h | 8 ++++++ kernel/sched/topology.c | 55 ++++++++++++++++++++++++++++++++++++++++ lib/cpumask.c | 12 ++------- lib/find_bit.c | 9 +++++++ 6 files changed, 127 insertions(+), 10 deletions(-) -- 2.34.1 -- This song goes out to all the folk that thought Stadia would work: https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz Dave Täht CEO, TekLibre, LLC _______________________________________________ LibreQoS mailing list [email protected] https://lists.bufferbloat.net/listinfo/libreqos
