Hi Rik, On 31/07/2020 09:42, Rik van Riel wrote: > Hello, > > last year at Linux Plumbers conference, I presented on my work > of turning the hierarchical CFS runqueue into a flat runqueue, > and Paul Turner pointed out some corner cases that could not > work with my design as it was last year. > > Paul pointed out two corner cases, and I have come up with a > third one myself, but I believe they all revolve around the > same issue, admission control, and can all be solved with the > same solution.
[...] > Possible solution > > A possible solution to the problems above consists of > three parts: > - Instead of accounting all used wall clock CPU time into > vruntime immediately, at the task's current hierarchical > priority, the vruntime can be accounted for piecemeal, for > example in amounts corresponding to "one timeslice at full > priority". > - If there is only one runnable task on the runqueue, > all the runtime can be accounted into vruntime in one go. > - Tasks that cannot account all of their used CPU time > into vruntime at once can be removed from the root runqueue, > and placed into the cgroup runqueue. A heap of cgroup > runqueues with "overloaded" tasks can be attached to the > main runqueue, where the left-most task from that heap of > heaps gets some vruntime accounted every time we go into > pick_next_task. > - The difference between the vruntime of each task in that > heap and the vruntime of the root runqueue can help determine > how much vruntime can be accounted to that task at once. > - If the task, or its runqueue, is no longer the left-most > in the heap after getting vruntime accounted, that runqueue > and the queue of runqueues can be resorted. > - Once a task has accounted all of its outstanding delta > exec runtime into vruntime, it can be moved back to the > main runqueue. > - This should solve the unequal task weight scenario Paul > Turner pointed out last year, because after task t1 goes > to sleep and only t2 and t3 remain on the CPU, t2 will > get its delta exec runtime converted into vruntime at > its new priority (equal to t3). > - By only accounting delta exec runtime to vruntime for > the left-most task in the "overloaded" heap at one time, > we guarantee only one task at a time will be added back > into the root runqueue. > - Every time a task is added to the root runqueue, that > slows down the rate at which vruntime advances, which > in turn reduces the rate at which tasks get added back > into the runqueue, and makes it more likely that a currently > running task with low hierarchical priority gets booted > off into the "overloaded" heap. > > To tackle the thundering herd at task wakeup time, another > strategy may be needed. One thing we may be able to do > there is place tasks into the "overloaded" heap immediately > on wakeup, if the hierarchical priority of the task is so > low that if the task were to run a minimal timeslice length, > it would be unable to account that time into its vruntime > in one go, AND the CPU already has a larger number of tasks > on it. > > Because the common case on most systems is having just > 0, 1, or 2 runnable tasks on a CPU, this fancy scheme > should rarely end up being used, and even when it is the > overhead should be reasonable because most of the > overloaded tasks will just sit there until pick_next_task > gets around to them. > > Does this seem like something worth trying out? > > Did I overlook any other corner cases that would make this > approach unworkable? > > Did I forget to explain anything that is needed to help > understand the problem and the proposed solution better? I imagine that I can see what you want to achieve here ;-) But it's hard since your v5 RFC https://lkml.kernel.org/r/20190906191237.27006-1-r...@surriel.com is pretty old by now. Do you have a version of the patch-set against tip/sched/core? Quite a lot has changed (runnable load avg replaced by runnable avg, rq->load is gone, CFS load balance rework). IIRC, the 'CFS flat runqueue design' has the advantage to reduce the overhead in taskgroup heavy environments like systemd. And I recall that v5 doesn't cover CFS bandwidth control yet. IMHO it would be extremely helpful to have a current patch-set to discuss how these other problems can be covered by patches on top.