During discussion of the scheduler deadline bug [1], Pierre Gondois
pointed out a potential issue during kexec: as CPUs are unplugged, the
available DL bandwidth of the root domain gradually decreases. At some
point, insufficient bandwidth triggers an overflow detection, causing
CPU hot-removal to fail and kexec to hang.[2]
I reproduced it on a system with 160 cpus with the following command
seq 10 | xargs -I{} -P10 sh -c 'chrt -d -T 1000000 -P 1000000 0 yes >
/dev/null &'
kexec -e
The system hang during the kexec process.
This series skips the DL bandwidth check, SIGSTOP all DL tasks so that
the kexec process can proceed.
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/all/[email protected]/
RFC -> v2:
Instead of migrating the DL tasks, SIGSTOP them.
Pingfan Liu (2):
sched/deadline: Skip the deadline bandwidth check if kexec_in_progress
kernel/kexec: Stop all userspace deadline tasks
kernel/kexec_core.c | 23 +++++++++++++++++++++++
kernel/sched/deadline.c | 7 +++++++
2 files changed, 30 insertions(+)
--
2.49.0