Hi, OK it is now very clear it is due to memory transactions between different nodes.
The test program is here: https://gist.github.com/jigsawecho/6a2e78d65f0fe67adf1b The test machine topology is: NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Change the 3rd param from 0 to 1 at line 135, and the LLC cache load miss boost from 0.09% to 33.45%. The LLC cache store miss boost from 0.027% to 50.695%. Clearly the root cause is transaction crossing the node boundary. But then how to resolve this problem is another topic... thx & rgds, -ql On Tue, Nov 11, 2014 at 5:37 PM, jigsaw <jigsaw at gmail.com> wrote: > Hi Bruce, > > I noticed that librte_distributor has quite sever LLC miss problem when > running on 16 cores. > While on 8 cores, there's no such problem. > The test runs on a Intel(R) Xeon(R) CPU E5-2670, a SandyBridge with 32 > cores on 2 sockets. > > The test case is the distributor_perf_autotest, i.e. > in app/test/test_distributor_perf.c. > The test result is collected by command: > > perf stat -e LLC-load-misses,LLC-loads,LLC-store-misses,LLC-stores ./test > -cff -n2 --no-huge > > Note that test results show that with or without hugepage, the LCC miss > rate remains the same. So I will just show --no-huge config. > > With 8 cores, the LLC miss rate is OK: > > LLC-load-misses 26750 > LLC-loads 93979233 > LLC-store-misses 432263 > LLC-stores 69954746 > > That is 0.028% of load miss and 0.62% of store miss. > > With 16 cores, the LLC miss rate is very high: > > LLC-load-misses 70263520 > LLC-loads 143807657 > LLC-store-misses 23115990 > LLC-stores 63692854 > > That is 48.9% load miss and 36.3% store miss. > > Most of the load miss happens at first line of rte_distributor_poll_pkt. > Most of the store miss happens at ... I don't know, because perf record on > LLC-store-misses brings down my machine. > > It's not so straightforward to me how could this happen: 8 core fine, but > 16 cores very bad. > My guess is that 16 cores bring in more QPI transaction between sockets? > Or 16 cores bring a different LLC access pattern? > > So I tried to reduce the padding inside union rte_distributor_buffer from > 3 cachelines to 1 cacheline. > > - char pad[CACHE_LINE_SIZE*3]; > + char pad[CACHE_LINE_SIZE]; > > And it does have a obvious result: > > LLC-load-misses 53159968 > LLC-loads 167756282 > LLC-store-misses 29012799 > LLC-stores 63352541 > > Now it is 31.69% of load miss, and 45.79% of store miss. > > It lows down the load miss rate, but raises the store miss rate. > Both numbers are still very high, sadly. > But the bright side is that it decrease the Time per burst and time per > packet. > > The original version has: > === Performance test of distributor === > Time per burst: 8013 > Time per packet: 250 > > And the patched ver has: > === Performance test of distributor === > Time per burst: 6834 > Time per packet: 213 > > > I tried a couple of other tricks. Such as adding more idle loops > in rte_distributor_get_pkt, > and making the rte_distributor_buffer thread_local to each worker core. > But none of this trick > has any noticeable outcome. These failures make me tend to believe the > high LLC miss rate > is related to QPI or NUMA. But my machine is not able to perf on uncore > QPI events so this > cannot be approved. > > > I cannot draw any conclusion or reveal the root cause after all. But I > suggest a further study on the performance bottleneck so as to find a good > solution. > > thx & > rgds, > -qinglai > >