On Wed, May 27, 2015 at 03:04:28PM -0400, r...@redhat.com wrote: > From: Rik van Riel <r...@redhat.com> > > Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks the > preferred node") fixes an issue where workloads would never converge > on a fully loaded (or overloaded) system. > > However, it introduces a regression on less than fully loaded systems, > where workloads converge on a few NUMA nodes, instead of properly staying > spread out across the whole system. This leads to a reduction in available > memory bandwidth, and usable CPU cache, with predictable performance problems. > > The root cause appears to be an interaction between the load balancer and > NUMA balancing, where the short term load represented by the load balancer > differs from the long term load the NUMA balancing code would like to base > its decisions on. > > Simply reverting a43455a1 would re-introduce the non-convergence of > workloads on fully loaded systems, so that is not a good option. As > an aside, the check done before a43455a1 only applied to a task's > preferred node, not to other candidate nodes in the system, so the > converge-on-too-few-nodes problem still happens, just to a lesser > degree. > > Instead, try to compensate for the impedance mismatch between the > load balancer and NUMA balancing by only ever considering a lesser > loaded node as a destination for NUMA balancing, regardless of > whether the task is trying to move to the preferred node, or to another > node. > > This patch also addresses the issue that a system with a single runnable > thread would never migrate that thread to near its memory, introduced by > 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced"). > > A test where the main thread creates a large memory area, and spawns > a worker thread to iterate over the memory (placed on another node > by select_task_rq_fair), after which the main thread goes to sleep > and waits for the worker thread to loop over all the memory now sees > the worker thread migrated to where the memory is, instead of having > all the memory migrated over like before. > > Jirka has run a number of performance tests on several systems: > single instance SpecJBB 2005 performance is 7-15% higher on a 4 node > system, with higher gains on systems with more cores per socket. > Multi-instance SpecJBB 2005 (one per node), linpack, and stream see > little or no changes with the revert of 095bebf61a46 and this patch. > > Signed-off-by: Rik van Riel <r...@redhat.com> > Reported-by: Artem Bityutski <dedeki...@gmail.com> > Reported-by: Jirka Hladky <jhla...@redhat.com> > Tested-by: Jirka Hladky <jhla...@redhat.com>
Acked-by: Mel Gorman <mgor...@suse.de> -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/