On Mon, Mar 30, 2015 at 08:26:19AM +0100, Preeti U Murthy wrote: > Hi Morten, > > On 03/27/2015 11:26 PM, Morten Rasmussen wrote: > > > > I agree that the current behaviour is undesirable and should be fixed, > > but IMHO waking up all idle cpus can not be justified. It is only one > > additional cpu though with your patch so it isn't quite that bad. > > > > I agree that it is hard to predict how many additional cpus you need, > > but I don't think you necessarily need that information as long as you > > start by filling up the cpu that was kicked to do the > > nohz_idle_balance() first. > > > > You would also solve your problem if you removed the ability for the cpu > > to bail out after balancing itself and force it to finish the job. It > > would mean harming tasks that where pulled to the balancing cpu as they > > would have to wait being scheduling until the nohz_idle_balance() has > > completed. It could be a price worth paying. > > But how would this prevent waking up idle CPUs ? You still end up waking > up all idle CPUs, wouldn't you?
That depends on the scenario. In your example from the changelog you would. You have enough work for all the nohz-idle cpus and you will keep iterating through all of them and pull work on their behalf and hence wake them up. But in a scenario where you don't have enough work for all nohz-idle cpus you are guaranteed that the balancer cpu has taken its share and doesn't go back to sleep immediately after finish nohz_idle_balance(). So all cpus woken up will have a task to run including the balancer cpu. > > > > > An alternative could be to let the balancing cpu balance itself first > > and bail out as it currently does, but let it kick the next nohz_idle > > cpu to continue the job if it thinks there is more work to be done. So > > you would get a chain of kicks that would stop when there nothing > > more to do be done. It isn't quite as fast as your solution as it would > > I am afraid there is more to this. If a given CPU is unable to pull > tasks, it could mean that it is an unworthy destination CPU. But it does > not mean that the other idle CPUs are unworthy of balancing too. > > So if the ILB CPU stops waking up idle CPUs when it has nothing to pull, > we will end up hurting load balancing. Take for example the scenario > described in the changelog. The idle CPUs within a numa node may find > load balanced within themselves and hence refrain from pulling any load. > If these ILB CPUs stop nohz idle load balancing at this point, the load > will never get spread across nodes. > > If on the other hand, if we keep kicking idle CPUs to carry on idle load > balancing, the wakeup scenario will be no better than it is with this patch. By more work to be done I didn't mean stopping when the balancer cpu gives up, I meant stopping the kick chain when there nothing more to be balanced/pulled (or reasonably close). For example use something like the nohz_kick_needed() checks on the source cpu/group and stopping if all cpus only have one runnable task. At least try to stop waking an extra cpu when there is clearly no point in doing so. > > require an IPI plus wakeup for each cpu to continue the work. But it > > should be much faster than the current code I think. > > > > IMHO it makes more sense to stay with the current scheme of ensuring > > that the kicked cpu is actually used before waking up more cpus and > > instead improve how additional cpus are kicked if they are needed. > > It looks more sensible to do this in parallel. The scenario on POWER is > that tasks don't spread out across nodes until 10s of fork. This is > unforgivable and we cannot afford the code to be the way it is today. You propose having multiple balancing cpus running in parallel? I fully agree that the nohz-balancing behaviour should be improved. It could use a few changes to improve energy-awareness as well. IMHO, taking a double hit every time we need to wake up an additional cpu is inefficient and goes against all the effort put into reducing wake-ups is essential for saving energy. One thing that would help reducing energy consumption and that vendors carry out-of-tree is improving find_new_ilb() to pick the cheapest cpu to be kicked for nohz-idle balancing. However, this improvement is pointless if we are going to wake up an additional cpu to receive the task(s) that needs to be balanced. I think a solution where we at least try to avoid waking up additional cpus and vastly improves your 10s latency is possible. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/