On Tue, Jan 03, 2012 at 01:42:38PM +, Ian Campbell wrote:
On Wed, 2011-12-28 at 01:49 +0100, Josip Rodin wrote:
This clock jump by 2999 seconds also happened here, so per:
http://old-list-archives.xen.org/archives/html/xen-devel/2011-02/msg01557.html
we switched to clocksource=pit in /etc/default/grub's $GRUB_CMDLINE_XEN on
the dom0. This seemed to have avoided the problem, but since then, the clock
jumps started happening like this:
Dec 21 19:42:23 dom0machine kernel: [6034768.658836] Clocksource tsc
unstable (delta = -811538856601 ns)
In addition, now I checked what the said machine thinks is its clocksource:
% cat /sys/devices/system/clocksource/clocksource0/current_clocksource
/sys/devices/system/clocksource/clocksource0/available_clocksource
xen
xen
So there's neither pit nor tsc in the available list :)
A PV kernel will (or should) always use xen as it's clocksource. This
is a PV timesource based around the TSC + correction factors (to account
for drift and PCPU migration).
The clocksource=pit on the hypervisor command line controls the
hypervisor's own timesource and not the dom0 kernels. I'm not sure how
you query the hypervisor for its timesource but I guess it'll be in xl
dmesg somewhere (Platform timer is ...).
Ah, d'oh :) sorry, I wasn't really thinking.
The xm dmesg output on HP DL360 machines that we have set to clocksource=pit
and that have nevertheless happened to shifted by more than 35996 seconds
in at least five incidents in the last six months says:
(XEN) Platform timer is 1.193MHz PIT
On a couple of FS RX300's that happened not to have clocksource=pit set but
had time shift by 2999.69 seconds it's this:
(XEN) Platform timer is 14.318MHz HPET
Both also show the following message after the time shift:
(XEN) Platform timer appears to have unexpectedly wrapped 10 or more times.
The message you quote above says *tsc* unstable. Prior to that was the
system actually using the tsc clocksource? It really shouldn't have
been... Before that message did available_clocksource contain TSC? What
about current_clocksource? (Before here ~= on a freshly booted system)
The dom0 machines where we set clocksource=pit do see the sole xen
clocksource. That didn't stop the time from going awry.
On the dom0 machines that don't have the hypervisor fixated on
clocksource=pit:
* one dom0 that sees both xen and tsc in available_clocksource, but uses
xen as current_clocksource. Not sure what it used at the time of the
failure in September, probably the same because we didn't touch that.
* one that recently failed has:
% dmesg | grep unstable
[4613030.883101] Clocksource tsc unstable (delta = -2999660301416 ns)
% cat /sys/devices/system/clocksource/clocksource0/*
xen
xen
What are your exact hypervisor and kernel command lines? Other than
clocksource=pit are you overriding anything else in this regard?
Most of the machines now seem to have:
GRUB_CMDLINE_LINUX=console=tty0 console=ttyS1,115200n1 elevator=deadline
GRUB_CMDLINE_XEN=dom0_mem=512M clocksource=pit cpuidle=0
The machines without clocksource=pit only had dom0_mem=512M for the
hypervisor and nothing for the dom0 kernel.
Can you press the 's' hypervisor debug key and report the resulting text
from dmesg. (press a debug key == xl debug-key s + xl dmesg or press
Ctrl-A 3 times on serial then press 's').
(Note that I used xm for both of those commands, I don't have xl.)
This is the output on a couple of of the DL360's with clocksource=pit:
(XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=3066
(count=1)
(XEN) dom2: mode=0,ofs=0x21e231c896,khz=2333479,inc=1,vtsc count: 10647611967
kernel, 454486411 user
(XEN) dom12: mode=0,ofs=0x21a01e68ddeb,khz=2333479,inc=1,vtsc count: 2478607037
kernel, 199833427 user
(XEN) dom17: mode=0,ofs=0x8d12c3820bf0b,khz=2333479,inc=1,vtsc count: 918220049
kernel, 56818086 user
(XEN) dom18: mode=0,ofs=0x8d1334e2f635f,khz=2333479,inc=1,vtsc count:
4707785417 kernel, 197043637 user
(XEN) dom21: mode=0,ofs=0x1004cc1e5bf801,khz=2333479,inc=1,vtsc count:
6386763431 kernel, 166512523 user
(XEN) dom22: mode=0,ofs=0x14b5955232a7e1,khz=2333479,inc=1,vtsc count:
2218555643 kernel, 88962103 user
(XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=1715
(count=1)
(XEN) dom1: mode=0,ofs=0x149170bd5f,khz=2333479,inc=1,vtsc count: 36234921552
kernel, 294922844 user
This is the output on an RX300 without clocksource=pit:
(XEN) TSC marked as reliable, warp = 0 (count=2)
(XEN) dom1: mode=0,ofs=0x59e046806,khz=2400116,inc=1
(XEN) No domains have emulated TSC
And finally this is the output on the odd machine that has tsc as an
available clock source:
(XEN) TSC marked as reliable, warp = 0 (count=2)
(XEN) dom1: mode=0,ofs=0x593b1f9e8,khz=2400190,inc=1
(XEN) dom4: mode=0,ofs=0xf3c77d49e41e6,khz=2400190,inc=1
(XEN) No domains have emulated TSC
In the latter case, I've no idea why the domU with the ID 4