On 11 April 2018 at 18:00, Peter Zijlstra <pet...@infradead.org> wrote: > On Wed, Apr 11, 2018 at 05:41:24PM +0200, Vincent Guittot wrote: >> Yes. and to be honest I don't have any clues of the root cause :-( >> Heiner mentioned that it's much better in latest linux-next but I >> haven't seen any changes related to the code of those patches > > Yeah, it's a bit of a puzzle. Now you touch nohz, and the patches in > next that are most likely to have affected this are rjw's > cpuidle-vs-nohz patches. The common demoninator being nohz. > > Now I think rjw's patches will ensure we enter nohz _less_, they avoid > stopping the tick when we expect to go idle for a short period only. > > So if your patch makes nohz go wobbly, going nohz less will make that > better. > > Of course, I've no actual clue as to what that patch (it's the last one > in the series, right?: > > 31e77c93e432 ("sched/fair: Update blocked load when newly idle") > > ) does that is so offensive to that one machine. You never did manage to > reproduce, right?
yes > > Could is be that for some reason the nohz balancer now takes a very long > time to run? Heiner mentions that is was a relatively slow celeron and he uses ondemand governor. So I was about to ask him to use performance governor to see if it can be because cpu runs slow and takes too muche time to enter idle > > Could something like the following happen (and this is really flaky > thinking here): > > last CPU goes idle, we enter idle_balance(), that kicks ilb, ilb runs, > which somehow again triggers idle_balance and around we go? > > I'm not immediately seeing how that could happen, but if we do something > daft like that we can tie up the CPU for a while, mostly with IRQs > disabled, and that would be visible as that latency he sees. > >