On 05/06/2015 01:00 PM, Peter Zijlstra wrote: > On Wed, May 06, 2015 at 11:41:28AM -0400, Rik van Riel wrote: > >> Peter, Mel, I think it may be time to stop waiting for the impedance >> mismatch between the load balancer and NUMA balancing to be resolved, >> and try to just avoid the issue in the NUMA balancing code... > > That's a wee bit unfair since we 'all' decided to let the numa thing > rest for a while. So obviously that issue didn't get resolved.
I'm not blaming anyone, I know I was involved in the decision to let the NUMA code rest for a while, too. After a year of just sitting there, this is the only big bug affecting the NUMA balancing code that I have heard about. >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index ffeaa4105e48..480e6a35ab35 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1409,6 +1409,30 @@ static void task_numa_find_cpu(struct task_numa_env >> *env, >> } >> } >> >> +/* Only move tasks to a NUMA node less busy than the current node. */ >> +static bool numa_has_capacity(struct task_numa_env *env) >> +{ >> + struct numa_stats *src = &env->src_stats; >> + struct numa_stats *dst = &env->dst_stats; >> + >> + if (src->has_free_capacity && !dst->has_free_capacity) >> + return false; >> + >> + /* >> + * Only consider a task move if the source has a higher destination >> + * than the destination, corrected for CPU capacity on each node. >> + * >> + * src->load dst->load >> + * --------------------- vs --------------------- >> + * src->compute_capacity dst->compute_capacity >> + */ >> + if (src->load * dst->compute_capacity > >> + dst->load * src->compute_capacity) >> + return true; >> + >> + return false; >> +} >> + >> static int task_numa_migrate(struct task_struct *p) >> { >> struct task_numa_env env = { >> @@ -1463,7 +1487,8 @@ static int task_numa_migrate(struct task_struct *p) >> update_numa_stats(&env.dst_stats, env.dst_nid); >> >> /* Try to find a spot on the preferred nid. */ >> - task_numa_find_cpu(&env, taskimp, groupimp); >> + if (numa_has_capacity(&env)) >> + task_numa_find_cpu(&env, taskimp, groupimp); >> >> /* >> * Look at other nodes in these cases: >> @@ -1494,7 +1519,8 @@ static int task_numa_migrate(struct task_struct *p) >> env.dist = dist; >> env.dst_nid = nid; >> update_numa_stats(&env.dst_stats, env.dst_nid); >> - task_numa_find_cpu(&env, taskimp, groupimp); >> + if (numa_has_capacity(&env)) >> + task_numa_find_cpu(&env, taskimp, groupimp); >> } >> } > > Does this not 'duplicate' the logic that we tried for with > task_numa_compare():balance section? That is where we try to avoid > making a decision that the regular load-balancer will dislike and undo. > > Alternatively; you can view that as a cpu guard and the proposed as a > node guard, in which case, should it not live inside > task_numa_find_cpu()? Instead of guarding all call sites. > > In any case, should we mix a bit of imbalance_pct in there? > > /me goes ponder this a bit further.. Yes, there is some duplication between this code and the logic in task_numa_compare() At this point I am not sure how to resolve that; I am interested in seeing whether this patch solves the issue reported by Artem and Jirka. If it does, we can think about cleanups. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/