We noticed that with cgroup CPU controller in use, the scheduling latency gets wonky regardless of nesting level or weight configuration. This is easily reproducible with Chris Mason's schbench[1].
All tests are run on a single socket, 16 cores, 32 threads machine. While the machine is mostly idle, it isn't completey. There's always some variable management load going on. The command used is schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 which measures scheduling latency with 32 threads each of which repeatedly runs for 15ms and then sleeps for 10ms. Here's a representative result when running from the root cgroup. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 26 75.0000th: 62 90.0000th: 74 95.0000th: 86 *99.0000th: 887 99.5000th: 3692 99.9000th: 10832 min=0, max=13374 The following is inside a first level CPU cgroup with the maximum weight. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 31 75.0000th: 65 90.0000th: 71 95.0000th: 91 *99.0000th: 7288 99.5000th: 10352 99.9000th: 12496 min=0, max=13023 Note the drastic increase in p99 scheduling latency. After investigation, it turned out that the update_sd_lb_stats(), which is used by load_balance() to pick the most loaded group, was often picking the wrong group. A CPU which has one schbench running and another queued wouldn't report the correspondingly higher weighted_cpuload() and get looked over as the target of load balancing. weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the sum of the load_avg of all queued sched_entities. Without cgroups or at the root cgroup, each task's load_avg contributes directly to the sum. When a task wakes up or goes to sleep, the change is immediately reflected on runnable_load_avg which in turn affects load balancing. However, when CPU cgroup is in use, a nesting cfs_rq blocks this immediate reflection. When a task wakes up inside a cgroup, the nested cfs_rq's runnable_load_avg is updated but doesn't get propagated to the parent. It only affects the matching sched_entity's load_avg over time which then gets propagated to the runnable_load_avg at that level. This makes weighted_cpuload() often temporarily out of sync leading to suboptimal behavior of load_balance() and increase in scheduling latencies as shown above. This patch fixes the issue by updating propagate_entity_load_avg() to always propagate to the parent's runnable_load_avg. Combined with the previous patch, this keeps a cfs_rq's runnable_load_avg always the sum of the scaled loads of all tasks queued below removing the artifacts from nesting cfs_rqs. The following is from inside three levels of nesting with the patch applied. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 40 75.0000th: 71 90.0000th: 89 95.0000th: 108 *99.0000th: 679 99.5000th: 3500 99.9000th: 10960 min=0, max=13790 [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git Signed-off-by: Tejun Heo <t...@kernel.org> Cc: Vincent Guittot <vincent.guit...@linaro.org> Cc: Ingo Molnar <mi...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Mike Galbraith <efa...@gmx.de> Cc: Paul Turner <p...@google.com> Cc: Chris Mason <c...@fb.com> --- kernel/sched/fair.c | 34 +++++++++++++++++++++------------- 1 file changed, 21 insertions(+), 13 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq /* Take into account change of load of a child task group */ static inline void -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, + bool propagate_avg) { struct cfs_rq *gcfs_rq = group_cfs_rq(se); long load = 0, delta; @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq se->avg.load_avg = load; se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX; - /* Update parent cfs_rq load */ - add_positive(&cfs_rq->avg.load_avg, delta); - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; + if (propagate_avg) { + /* Update parent cfs_rq load */ + add_positive(&cfs_rq->avg.load_avg, delta); + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; + } /* * If the sched_entity is already enqueued, we also have to update the @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_ /* Update task and its cfs_rq load average */ static inline int propagate_entity_load_avg(struct sched_entity *se) { - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + bool propagate_avg; if (entity_is_task(se)) return 0; - if (!test_and_clear_tg_cfs_propagate(se)) - return 0; - - cfs_rq = cfs_rq_of(se); + propagate_avg = test_and_clear_tg_cfs_propagate(se); - set_tg_cfs_propagate(cfs_rq); + /* + * We want to keep @cfs_rq->runnable_load_avg always in sync so + * that the load balancer can accurately determine the busiest CPU + * regardless of cfs_rq nesting. + */ + update_tg_cfs_load(cfs_rq, se, propagate_avg); - update_tg_cfs_util(cfs_rq, se); - update_tg_cfs_load(cfs_rq, se); + if (propagate_avg) { + set_tg_cfs_propagate(cfs_rq); + update_tg_cfs_util(cfs_rq, se); + } - return 1; + return propagate_avg; } #else /* CONFIG_FAIR_GROUP_SCHED */