On 01/25/2018 05:40 PM, Hans van Kranenburg wrote:
Hi Ryan,

On 01/26/2018 12:20 AM, Ryan Thoryk wrote:
Package: linux-image-4.9.0-5-amd64
Version: 4.9.65-3+deb9u2
Severity: normal

I'm having an issue with CPU usage reporting, tested on kernels 4.9.0-3
and 4.9.0-5.  The machines are running on Amazon EC2, which could be
related.  With the "sar" utility, after some time, the system's "steal"
value periodically is 100%,
This means that your vcpus want to execute work but are not being
scheduled on a physical cpu core. Either the physical machine gets too
much work from all the virtual machines that are requesting cpu time, or
other things are going on, like your virtual machine getting paused
(e.g. when doing live migration there's a handover moment when it's
shortly paused and then resumed, this is also visible as a short 100%
steal spike).
After going over log files, it appears that the issue started when Amazon did a live migration of the VM, probably for the Meltdown patching.
and the normal CPU user/system values,
including idle, are always 0.  When running a cpu-intensive app and
using the "top" utility, the user and system values are always 0, the
"idle" field stays at 100%, and only the "wait" field increases.
Sounds a lot like this one:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=871608

A patch to fix that cpu accounting breakage (picked from linux 4.15) was
included in 4.9.65-3. So only for the 4.9.0-3 (which actual version?)
you could be seeing that one happening.
The versions were both 4.9.30-2+deb9u2 and the latest, 4.9.65-3+deb9u2.  So basically the kernel never recovered properly after being paused during a live migration.
The attached file shows the "sar" output around the time the issue
started.  This has happened on 2 separate machines (started at different
times on each), and a reboot appears to (temporarily) fix the issue.
I'm wondering if anyone else has this issue, and if it could be
something to do with the hypervisor.
Because of the mentioned steal time fix that was included in a version
in between the 2 versions you mention, my first suggestion would be to
see if the symptoms on the old and new kernel are exactly the same, or
if they are only similar but different.

Hans
I already rebooted the system running the 4.9.65 kernel, and beforehand, the symptoms were the same.  The CPU usage stats went back to normal after the reboot.

--
Ryan Thoryk
r...@thoryk.com
r...@tliquest.net

Reply via email to