On Wed, Nov 12, 2025 at 11:02:47AM +0800, Guo, Wangyang wrote: > On 11/11/2025 8:08 PM, Ming Lei wrote: > > On Tue, Nov 11, 2025 at 01:31:04PM +0800, Guo, Wangyang wrote: > > > On 11/11/2025 11:25 AM, Ming Lei wrote: > > > > On Tue, Nov 11, 2025 at 10:06:08AM +0800, Wangyang Guo wrote: > > > > > As CPU core counts increase, the number of NVMe IRQs may be smaller > > > > > than > > > > > the total number of CPUs. This forces multiple CPUs to share the same > > > > > IRQ. If the IRQ affinity and the CPU’s cluster do not align, a > > > > > performance penalty can be observed on some platforms. > > > > > > > > Can you add details why/how CPU cluster isn't aligned with IRQ > > > > affinity? And how performance penalty is caused? > > > > > > Intel Xeon E platform packs 4 CPU cores as 1 module (cluster) and share > > > the > > > L2 cache. Let's say, if there are 40 CPUs in 1 NUMA domain and 11 IRQs to > > > dispatch. The existing algorithm will map first 7 IRQs each with 4 CPUs > > > and > > > remained 4 IRQs each with 3 CPUs each. The last 4 IRQs may have cross > > > cluster issue. For example, the 9th IRQ which pinned to CPU32, then for > > > CPU31, it will have cross L2 memory access. > > > > > > CPUs sharing L2 usually have small number, and it is common to see one queue > > mapping includes CPUs from different L2. > > > > So how much does crossing L2 hurt IO perf? > We see 15%+ performance difference in FIO libaio/randread/bs=8k.
As I mentioned, it is common to see CPUs crossing L2 in same group, but why does it make a difference here? You mentioned just some platforms are affected. > > They still should share same L3 cache, and cpus_share_cache() should be > > true when the IO completes on the CPU which belong to different L2 with the > > submission CPU, and remote completion via IPI won't be triggered. > Yes, remote IPI not triggered. OK, in my test on AMD zen4, NVMe performance can be dropped to 1/2 - 1/3 if remote IPI is triggered in case of crossing L3, which is understandable. I will check if topo cluster can cover L3, if yes, the patch still can be simplified a lot by introducing sub-node spread by changing build_node_to_cpumask() and adding nr_sub_nodes. Thanks, Ming
