On Fri, Feb 17, 2017 at 12:07:30PM +0000, Matt Fleming wrote:
> If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
> the pending sample window time on exit, setting the next update not
> one window into the future, but two.
>
> This situation on exiting NO_HZ is described by:
>
> this_rq->calc_load_update < jiffies < calc_load_update
>
> In this scenario, what we should be doing is:
>
> this_rq->calc_load_update = calc_load_update [ next
> window ]
>
> But what we actually do is:
>
> this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]
>
> This has the effect of delaying load average updates for potentially
> up to ~9seconds.
>
> This can result in huge spikes in the load average values due to
> per-cpu uninterruptible task counts being out of sync when accumulated
> across all CPUs.
>
> It's safe to update the per-cpu active count if we wake between sample
> windows because any load that we left in 'calc_load_idle' will have
> been zero'd when the idle load was folded in calc_global_load().
>
> This issue is easy to reproduce before,
>
> commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization
> average tracking")
>
> just by forking short-lived process pipelines built from ps(1) and
> grep(1) in a loop. I'm unable to reproduce the spikes after that
> commit, but the bug still seems to be present from code review.
>
> Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation --
> again")
> Cc: Peter Zijlstra <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Vincent Guittot <[email protected]>
Acked-by: Frederic Weisbecker <[email protected]>
Thanks it's much clearer now!