Package: src:linux
Version: 4.9.30-2+deb9u3
Severity: normal
Tags: patch

Dear Maintainer,

running Debian Stretch as a paravirtualized guest under Xen, the kernel
obtains its cpu steal time counter from the virtualization host. On some
hosts, occasionally a slight decrease in the cpu steal time is returned
which leads to an overflow of unsigned variables in the kernel and
subsequent errors in steal time accounting (such as backwards running
counters). This renders tools like "top" or "vmstat" broken in a way
that the cpu utilization cannot be determined anymore.

While this is likely a bug in the virtualization environment, the kernel
running as a guest should deal with this gracefully. I attached a patch
to this report which fixes the errors caused by this on the guest.
Kernel versions 4.7 and older, as well as 4.11 and newer should not be
affected by this issue.

Bug #785557 shows that behavior like this is caused by some broken KVM
hosts. I myself experience this on a Xen host which unfortunately I have
no more information about.

A more detailled description of the issue is part of the patch header,
as well as the following blog post:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/

I would appreciate inclusion of this patch in Debian as this issue may
affect other people running on buggy virtualization hosts and the patch
should not influence other systems.

Note that the system I report this from already runs a customly patched
kernel which may influence some of the information below.

-- Package-specific info:
** Version:
Linux version 4.9.0-3-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 
20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u3+lass1 (2017-08-08)

** Command line:
root=/dev/xvda ro 

** Not tainted

** Kernel log:
Unable to read kernel log; any relevant messages should be attached

** Model information

** Loaded modules:
ipt_REJECT
nf_reject_ipv4
binfmt_misc
xt_multiport
iptable_filter
intel_rapl
sb_edac
edac_core
evdev
kvm_intel
kvm
irqbypass
crct10dif_pclmul
crc32_pclmul
ghash_clmulni_intel
pcspkr
intel_rapl_perf
ip_tables
x_tables
autofs4
ext4
crc16
jbd2
fscrypto
ecb
mbcache
btrfs
crc32c_generic
xor
raid6_pq
crc32c_intel
xen_netfront
xen_blkfront
aesni_intel
aes_x86_64
glue_helper
lrw
gf128mul
ablk_helper
cryptd

** PCI devices:
not available

** USB devices:
not available


-- System Information:
Debian Release: 9.1
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 4.9.0-3-amd64 (SMP w/1 CPU core)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages linux-image-4.9.0-3-amd64 depends on:
ii  initramfs-tools [linux-initramfs-tool]  0.130
ii  kmod                                    23-2
ii  linux-base                              4.5

Versions of packages linux-image-4.9.0-3-amd64 recommends:
ii  firmware-linux-free  3.4
ii  irqbalance           1.1.0-2.3

Versions of packages linux-image-4.9.0-3-amd64 suggests:
pn  debian-kernel-handbook               <none>
pn  grub-pc | grub-efi-amd64 | extlinux  <none>
pn  linux-doc-4.9                        <none>

Versions of packages linux-image-4.9.0-3-amd64 is related to:
pn  firmware-amd-graphics     <none>
pn  firmware-atheros          <none>
pn  firmware-bnx2             <none>
pn  firmware-bnx2x            <none>
pn  firmware-brcm80211        <none>
pn  firmware-cavium           <none>
pn  firmware-intel-sound      <none>
pn  firmware-intelwimax       <none>
pn  firmware-ipw2x00          <none>
pn  firmware-ivtv             <none>
pn  firmware-iwlwifi          <none>
pn  firmware-libertas         <none>
pn  firmware-linux-nonfree    <none>
pn  firmware-misc-nonfree     <none>
pn  firmware-myricom          <none>
pn  firmware-netxen           <none>
pn  firmware-qlogic           <none>
pn  firmware-realtek          <none>
pn  firmware-samsung          <none>
pn  firmware-siano            <none>
pn  firmware-ti-connectivity  <none>
pn  xen-hypervisor            <none>

-- no debconf information
>From 4b66621a06a94d22629661a9262f92b8cf5b7ca9 Mon Sep 17 00:00:00 2001
From: Michael Lass <be...@bi-co.net>
Date: Sun, 6 Aug 2017 18:09:21 +0200
Subject: [PATCH] sched/cputime: handle decreasing steal clock

On some flaky Xen hosts, the steal clock returned by paravirt_steal_clock is
not monotonically increasing but can slightly decrease. Currently this results
in an overflow of u64 steal. Before giving this number to account_steal_time()
it is converted into cputime, so the target cpustat counter
cpustat[CPUTIME_STEAL] is not overflowing as well but instead increased by a
large amount. Due to the conversion to cputime and back into nanoseconds,
this_rq()->prev_steal_time does not correctly reflect the latest reported steal
clock afterwards, resulting in erratic behavior such as backwards running
cpustat[CPUTIME_STEAL]. The following is a trace from userspace of the value for
steal time reported in /proc/stat:

time    stolen         diff
----    ------         ----
0ms     784
100ms   1844670130367  1844670129583
200ms   1844664564089  -5566278
300ms   1844659554439  -5009650
400ms   1844655101417  -4453022

This issue was probably introduced by the following commits, which deactivate a
check for (steal < 0) in the Xen pv guest codepath and allow unlimited jumps of
the cpustat counters (both introduced in v4.8):
ecb23dc6f2eff0ce64dd60351a81f376f13b12cc
03cbc732639ddcad15218c4b2046d255851ff1e3

As a workaround, ignore decreasing values steal clock. By not updating
this_rq()->prev_steal_time we make sure that steal time is only accuonted as
soon as the steal clock raises above the value that was already observed and
accounted for earlier.

In current kernel versions (v4.11 and higher) this issue should not exist since
conversion between nsec and cputime has been eliminated. Therefore all values
will overflow, i.e. decrease as reported by the host system.
---
 kernel/sched/cputime.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5ebee3164e64..5f039f7f9294 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -262,10 +262,19 @@ static __always_inline cputime_t 
steal_account_process_time(cputime_t maxtime)
 #ifdef CONFIG_PARAVIRT
        if (static_key_false(&paravirt_steal_enabled)) {
                cputime_t steal_cputime;
-               u64 steal;
-
-               steal = paravirt_steal_clock(smp_processor_id());
-               steal -= this_rq()->prev_steal_time;
+               u64 steal_time;
+               s64 steal;
+
+               steal_time = paravirt_steal_clock(smp_processor_id());
+               steal = steal_time - this_rq()->prev_steal_time;
+
+               if (unlikely(steal < 0)) {
+                       printk_ratelimited(KERN_DEBUG "cputime: steal_clock for 
"
+                               "processor %d decreased: %llu -> %llu, "
+                               "ignoring\n", smp_processor_id(),
+                               this_rq()->prev_steal_time, steal_time);
+                       return 0;
+               }
 
                steal_cputime = min(nsecs_to_cputime(steal), maxtime);
                account_steal_time(steal_cputime);
-- 
2.14.0

Reply via email to