On 11/13/2025 9:38 AM, Ming Lei wrote:
On Wed, Nov 12, 2025 at 11:02:47AM +0800, Guo, Wangyang wrote:
On 11/11/2025 8:08 PM, Ming Lei wrote:
On Tue, Nov 11, 2025 at 01:31:04PM +0800, Guo, Wangyang wrote:
On 11/11/2025 11:25 AM, Ming Lei wrote:
On Tue, Nov 11, 2025 at 10:06:08AM +0800, Wangyang Guo wrote:
As CPU core counts increase, the number of NVMe IRQs may be smaller than
the total number of CPUs. This forces multiple CPUs to share the same
IRQ. If the IRQ affinity and the CPU’s cluster do not align, a
performance penalty can be observed on some platforms.
Can you add details why/how CPU cluster isn't aligned with IRQ
affinity? And how performance penalty is caused?
Intel Xeon E platform packs 4 CPU cores as 1 module (cluster) and share the
L2 cache. Let's say, if there are 40 CPUs in 1 NUMA domain and 11 IRQs to
dispatch. The existing algorithm will map first 7 IRQs each with 4 CPUs and
remained 4 IRQs each with 3 CPUs each. The last 4 IRQs may have cross
cluster issue. For example, the 9th IRQ which pinned to CPU32, then for
CPU31, it will have cross L2 memory access.
CPUs sharing L2 usually have small number, and it is common to see one queue
mapping includes CPUs from different L2.
So how much does crossing L2 hurt IO perf?
We see 15%+ performance difference in FIO libaio/randread/bs=8k.
As I mentioned, it is common to see CPUs crossing L2 in same group, but why
does it make a difference here? You mentioned just some platforms are
affected.
We observed the performance difference in Intel Xeon E platform which
has 4 physical CPU cores as 1 module (cluster) sharing the same L2
cache. For other platforms like Intel P-core or AMD, I think it's hard
to show performance benefit with L2 locality, because:
1. L2 cache is only shared within 2 logic core when HT enabled
2. If IRQ pinned to corresponding HT core, the L2 cache locality is
good, but other aspects like retiring maybe affected since they are
sharing the same physical CPU resources.
BR
Wangyang