This bug is missing log files that will aid in diagnosing the problem.
While running an Ubuntu kernel (not a mainline or third-party kernel)
please enter the following command in a terminal window:
apport-collect 1836971
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable
to run this command, please add a comment stating that fact and change
the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the
Ubuntu Kernel Team.
** Changed in: linux (Ubuntu)
Status: New => Incomplete
** Tags added: bionic
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971
Title:
sched: Prevent CPU lockups when task groups take longer than the
period
Status in linux package in Ubuntu:
Incomplete
Status in linux source package in Bionic:
In Progress
Bug description:
BugLink: https://bugs.launchpad.net/bugs/1836971
[Impact]
On machines with extremely high CPU usage, parent task groups which
have a large number of children can make the for loop in
sched_cfs_period_timer() run until the watchdog fires when the
cfs_period_us setting is too short.
In this particular case, it is unlikely that the call to
hrtimer_forward_now() will return 0, meaning the for loop is never
left, and tasks are never rescheduled.
The large number of children makes do_sched_cfs_period_timer() take
longer than the period, which impacts calls to hrtimer_forward_now().
The kernel will produce this call trace:
CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
#54-Ubuntu
Call Trace:
<IRQ>
? sched_clock+0x9/0x10
walk_tg_tree_from+0x61/0xd0
? task_rq_unlock+0x30/0x30
unthrottle_cfs_rq+0xcb/0x1a0
distribute_cfs_runtime+0xd7/0x100
sched_cfs_period_timer+0xd9/0x1a0
? sched_cfs_slack_timer+0xe0/0xe0
__hrtimer_run_queues+0xdf/0x230
hrtimer_interrupt+0xa0/0x1d0
smp_apic_timer_interrupt+0x6f/0x130
apic_timer_interrupt+0x84/0x90
</IRQ>
This has been hit in production in a particularly highly utilised
hadoop cluster which is powering an analytics platform. About 30% of
the cluster experiences this issue every week, and the machines need a
manual reboot to get back online.
[Fix]
This was fixed in 5.1 upstream with the below commit:
commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
Author: Phil Auld <[email protected]>
Date: Tue Mar 19 09:00:05 2019 -0400
subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
This commit adds a check to see if the loop has run too many times, and if it
has, scales up the period and quota, so the timer can complete before the
next period expires, which enables the task to be rescheduled normally.
Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in
upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.
Please cherry pick 2e8e19226398db8265a8e675fcc0118b9e80c9e8 to all bionic
kernels.
[Testcase]
Kind of hard to reproduce, so this was tested on a production hadoop cluster
with extremely high CPU load.
I built a test kernel, which is available here:
https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test
For unpatched kernels, expect the machine to lockup and print the call
trace in the impact section.
For patched kernels, if the machine hits the condition, it will print
a warning to the kernel log with the new period and quota which has
been used:
Example from the same hadoop cluster with a machine running the test
kernel:
% uname -a
4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
% sudo grep cfs /var/log/kern.log.*
cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us
67872, cfs_quota_us = 3475091)
cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us
22430, cfs_quota_us = 1148437)
cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us
25759, cfs_quota_us = 1318908)
cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us
29583, cfs_quota_us = 1514684)
cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us
33974, cfs_quota_us = 1739519)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us
39017, cfs_quota_us = 1997729)
cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us
44809, cfs_quota_us = 2294267)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us
51460, cfs_quota_us = 2634823)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us
59099, cfs_quota_us = 3025929)
cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us
67872, cfs_quota_us = 3475091)
[Regression Potential]
This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
community.
Xenial received this patch in 4.4.0-150.176, as per LP #1828420
Disco will receive this patch in the next version, as per LP #1830922
Eoan already has the patch, being based on 5.2.
While this does effect a core part of the kernel, the scheduler, the
patch has been extensively tested, and it has been proven in
production environments, so the overall risk is low.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1836971/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp