Hey, one more question to the already known (to some of you) tuning
project. We can see excellent performance, findings how to achieve that
will soon be published. There is one more little thing that keeps me up at
night though ;)

2x E5-2697 v3, 2x X710 now, one per NUMA node.

I use isolcpus kernel command line switch, it efficiently removes cores 1+
from scheduler. IRQ affinity is also used to pin processing of card's data.

Core 0 - housekeeping
Core 1 - hardware + software IRQ + kernel side of af_packet processing
Cores 2..N - my workload

When I measured L3 cache hits with

perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
-C1 (and C2, etc)

I got excellent results - like 0.3% misses or so, on all cores. DCA really
works after lowering the ring size with ethtool - to 512.

Then I started using RPS - to move some parts of the kernel workload to a
separate core, when I saw Core 1 being frequently pegged

Core 0 - housekeeping
Core 1 - hardware + parts of software IRQ, RPS starts
Core 2 - rest of software IRQ, kernel side of af_packet processing
Core 3+ - my workload

L3 cache misses in cores 2, 3 and up are still very low, like 0.4% - but L3
misses on Core 1 suddenly went up a lot, to at least 8%.

It kinds of ruins my understanding of how is supposed to work here - should
not it be putting data it the L3 cache? Or am I misinterpreting results or
measuring it wrong?
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
E1000-devel mailing list
To learn more about Intel® Ethernet, visit 

Reply via email to