Hi,

There is a huge performance regression on the 2 and 4 NUMA node systems on stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack and NAS parallel benchmarks show upto 50% performance drop.

When running for example 20 stream processes in parallel, we see the following 
behavior:

* all processes are started at NODE #1
* memory is also allocated on NODE #1
* roughly half of the processes are moved to the NODE #0 very quickly. * however, memory is not moved to NODE #0 and stays allocated on NODE #1

As the result, half of the processes are running on NODE#0 with memory being still allocated on NODE#1. This leads to non-local memory accesses on the high Remote-To-Local Memory Access Ratio on the numatop charts.
So it seems that 4.17 is not doing a good job to move the memory to the right 
NUMA
node after the process has been moved.

----8<----

The above is an excerpt from performance testing on 4.16 and 4.17 kernels.

For now I'm merely making sure the problem is reported.

Thank you.

Best regards,
Jakub Racek

Reply via email to