Task migration under numa balancing can happen in parallel. More than one task might choose to migrate to the same cpu at the same time. This can result in - During task swap, choosing a task that was not part of the evaluation. - During task swap, task which just got moved into its preferred node, moving to a completely different node. - During task swap, task failing to move to the preferred node, will have to wait an extra interval for the next migrate opportunity. - During task movement, multiple task movements can cause load imbalance.
This problem is more likely if there are more cores per node or more nodes in the system. Use a per run-queue variable to check if numa-balance is active on the run-queue. specjbb2005 / bops/JVM / higher bops are better on 2 Socket/2 Node Intel JVMS Prev Current %Change 4 199709 206350 3.32534 1 330830 319963 -3.28477 on 2 Socket/4 Node Power8 (PowerNV) JVMS Prev Current %Change 8 89011.9 89627.8 0.69193 1 218946 211338 -3.47483 on 2 Socket/2 Node Power9 (PowerNV) JVMS Prev Current %Change 4 180473 186539 3.36117 1 212805 220344 3.54268 on 4 Socket/4 Node Power7 JVMS Prev Current %Change 8 56941.8 56836 -0.185804 1 111686 112970 1.14965 dbench / transactions / higher numbers are better on 2 Socket/2 Node Intel count Min Max Avg Variance %Change 5 12029.8 12124.6 12060.9 34.0076 5 13136.1 13170.2 13150.2 14.7482 9.03166 on 2 Socket/4 Node Power8 (PowerNV) count Min Max Avg Variance %Change 5 4968.51 5006.62 4981.31 13.4151 5 4319.79 4998.19 4836.53 261.109 -2.90646 on 2 Socket/2 Node Power9 (PowerNV) count Min Max Avg Variance %Change 5 9342.92 9381.44 9363.92 12.8587 5 9325.56 9402.7 9362.49 25.9638 -0.0152714 on 4 Socket/4 Node Power7 count Min Max Avg Variance %Change 5 143.4 188.892 170.225 16.9929 5 132.581 191.072 170.554 21.6444 0.193274 Acked-by: Mel Gorman <mgor...@techsingularity.net> Reviewed-by: Rik van Riel <r...@surriel.com> Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com> --- Changelog v2->v3: Add comments as requested by Peter. kernel/sched/fair.c | 22 ++++++++++++++++++++++ kernel/sched/sched.h | 1 + 2 files changed, 23 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 309c93f..5cf921a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1514,6 +1514,21 @@ struct task_numa_env { static void task_numa_assign(struct task_numa_env *env, struct task_struct *p, long imp) { + struct rq *rq = cpu_rq(env->dst_cpu); + + /* Bail out if run-queue part of active numa balance. */ + if (xchg(&rq->numa_migrate_on, 1)) + return; + + /* + * Clear previous best_cpu/rq numa-migrate flag, since task now + * found a better cpu to move/swap. + */ + if (env->best_cpu != -1) { + rq = cpu_rq(env->best_cpu); + WRITE_ONCE(rq->numa_migrate_on, 0); + } + if (env->best_task) put_task_struct(env->best_task); if (p) @@ -1569,6 +1584,9 @@ static void task_numa_compare(struct task_numa_env *env, long moveimp = imp; int dist = env->dist; + if (READ_ONCE(dst_rq->numa_migrate_on)) + return; + rcu_read_lock(); cur = task_rcu_dereference(&dst_rq->curr); if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur))) @@ -1710,6 +1728,7 @@ static int task_numa_migrate(struct task_struct *p) .best_cpu = -1, }; struct sched_domain *sd; + struct rq *best_rq; unsigned long taskweight, groupweight; int nid, ret, dist; long taskimp, groupimp; @@ -1811,14 +1830,17 @@ static int task_numa_migrate(struct task_struct *p) */ p->numa_scan_period = task_scan_start(p); + best_rq = cpu_rq(env.best_cpu); if (env.best_task == NULL) { ret = migrate_task_to(p, env.best_cpu); + WRITE_ONCE(best_rq->numa_migrate_on, 0); if (ret != 0) trace_sched_stick_numa(p, env.src_cpu, env.best_cpu); return ret; } ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu); + WRITE_ONCE(best_rq->numa_migrate_on, 0); if (ret != 0) trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task)); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4a2e8ca..0b91612 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -783,6 +783,7 @@ struct rq { #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; unsigned int nr_preferred_running; + unsigned int numa_migrate_on; #endif #define CPU_LOAD_IDX_MAX 5 unsigned long cpu_load[CPU_LOAD_IDX_MAX]; -- 1.8.3.1