On 10/29/25 19:08, Andrea Righi wrote: > sched_ext tasks can be starved by long-running RT tasks, especially since > RT throttling was replaced by deadline servers to boost only SCHED_NORMAL > tasks. > > Several users in the community have reported issues with RT stalling > sched_ext tasks. This is fairly common on distributions or environments > where applications like video compositors, audio services, etc. run as RT > tasks by default. > > Example trace (showing a per-CPU kthread stalled due to the sway Wayland > compositor running as an RT task): > > runnable task stall (kworker/0:0[106377] failed to run for 5.043s) > ... > CPU 0 : nr_run=3 flags=0xd cpu_rel=0 ops_qseq=20646200 pnt_seq=45388738 > curr=sway[994] class=rt_sched_class > R kworker/0:0[106377] -5043ms > scx_state/flags=3/0x1 dsq_flags=0x0 ops_state/qseq=0/0 > sticky/holding_cpu=-1/-1 dsq_id=0x8000000000000002 dsq_vtime=0 > slice=20000000 > cpus=01 > > This is often perceived as a bug in the BPF schedulers, but in reality > schedulers can't do much: RT tasks run outside their control and can > potentially consume 100% of the CPU bandwidth. > > Fix this by adding a sched_ext deadline server, so that sched_ext tasks are > also boosted and do not suffer starvation. > > Two kselftests are also provided to verify the starvation fixes and > bandwidth allocation is correct. > > == Highlights in this version == > > - wait for inactive_task_timer() to fire before removing the bandwidth > reservation (Juri/Peter: please check if this new > dl_server_remove_params() implementation makes sense to you) > - removed the explicit dl_server_stop() from dequeue_task_scx() and rely > on the delayed stop behavior (Juri/Peter: ditto) > > This patchset is also available in the following git branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-dl-server > > Changes in v10: > - reordered patches to better isolate sched_ext changes vs sched/deadline > changes (Andrea Righi) > - define ext_server only with CONFIG_SCHED_CLASS_EXT=y (Andrea Righi) > - add WARN_ON_ONCE(!cpus) check in dl_server_apply_params() (Andrea Righi) > - wait for inactive_task_timer to fire before removing the bandwidth > reservation (Juri Lelli) > - remove explicit dl_server_stop() in dequeue_task_scx() to reduce timer > reprogramming overhead (Juri Lelli) > - do not restart pick_task() when invoked by the dl_server (Tejun Heo) > - rename rq_dl_server to dl_server (Peter Zijlstra) > - fixed a missing dl_server start in dl_server_on() (Christian Loehle) > - add a comment to the rt_stall selftest to better explain the 4% > threshold (Emil Tsalapatis) > > Changes in v9: > - Drop the ->balance() logic as its functionality is now integrated into > ->pick_task(), allowing dl_server to call pick_task_scx() directly > - Link to v8: > https://lore.kernel.org/all/[email protected]/ > > Changes in v8: > - Add tj's patch to de-couple balance and pick_task and avoid changing > sched/core callbacks to propagate @rf > - Simplify dl_se->dl_server check (suggested by PeterZ) > - Small coding style fixes in the kselftests > - Link to v7: > https://lore.kernel.org/all/[email protected]/ > > Changes in v7: > - Rebased to Linus master > - Link to v6: > https://lore.kernel.org/all/[email protected]/ > > Changes in v6: > - Added Acks to few patches > - Fixes to few nits suggested by Tejun > - Link to v5: > https://lore.kernel.org/all/[email protected]/ > > Changes in v5: > - Added a kselftest (total_bw) to sched_ext to verify bandwidth values > from debugfs > - Address comment from Andrea about redundant rq clock invalidation > - Link to v4: > https://lore.kernel.org/all/[email protected]/ > > Changes in v4: > - Fixed issues with hotplugged CPUs having their DL server bandwidth > altered due to loading SCX > - Fixed other issues > - Rebased on Linus master > - All sched_ext kselftests reliably pass now, also verified that the > total_bw in debugfs (CONFIG_SCHED_DEBUG) is conserved with these patches > - Link to v3: > https://lore.kernel.org/all/[email protected]/ > > Changes in v3: > - Removed code duplication in debugfs. Made ext interface separate > - Fixed issue where rq_lock_irqsave was not used in the relinquish patch > - Fixed running bw accounting issue in dl_server_remove_params > - Link to v2: > https://lore.kernel.org/all/[email protected]/ > > Changes in v2: > - Fixed a hang related to using rq_lock instead of rq_lock_irqsave > - Added support to remove BW of DL servers when they are switched to/from EXT > - Link to v1: > https://lore.kernel.org/all/[email protected]/ > > Andrea Righi (5): > sched/deadline: Add support to initialize and remove dl_server bandwidth > sched_ext: Add a DL server for sched_ext tasks > sched/deadline: Account ext server bandwidth > sched_ext: Selectively enable ext and fair DL servers > selftests/sched_ext: Add test for sched_ext dl_server > > Joel Fernandes (6): > sched/debug: Fix updating of ppos on server write ops > sched/debug: Stop and start server based on if it was active > sched/deadline: Clear the defer params > sched/deadline: Add a server arg to dl_server_update_idle_time() > sched/debug: Add support to change sched_ext server params > selftests/sched_ext: Add test for DL server total_bw consistency > > kernel/sched/core.c | 3 + > kernel/sched/deadline.c | 169 +++++++++++--- > kernel/sched/debug.c | 171 +++++++++++--- > kernel/sched/ext.c | 144 +++++++++++- > kernel/sched/fair.c | 2 +- > kernel/sched/idle.c | 2 +- > kernel/sched/sched.h | 8 +- > kernel/sched/topology.c | 5 + > tools/testing/selftests/sched_ext/Makefile | 2 + > tools/testing/selftests/sched_ext/rt_stall.bpf.c | 23 ++ > tools/testing/selftests/sched_ext/rt_stall.c | 222 ++++++++++++++++++ > tools/testing/selftests/sched_ext/total_bw.c | 281 > +++++++++++++++++++++++ > 12 files changed, 955 insertions(+), 77 deletions(-) > create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c > create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c > create mode 100644 tools/testing/selftests/sched_ext/total_bw.c
Thanks Andrea, I've tested a few things I had in mind with no complaints. Most importantly it a) it doesn't break the existing fair_server and b) Ensures BPF schedulers don't stall even with something like: sudo chrt -r 95 stress-ng --cpu 0 --taskset 0-$(($(nproc)-1)) -t 30m For patches 0 to 9: Tested-by: Christian Loehle <[email protected]>

