Bug#599161: ditto

2012-01-04 Thread Josip Rodin
On Tue, Jan 03, 2012 at 01:42:38PM +, Ian Campbell wrote:
 On Wed, 2011-12-28 at 01:49 +0100, Josip Rodin wrote:
  This clock jump by 2999 seconds also happened here, so per:
  
  http://old-list-archives.xen.org/archives/html/xen-devel/2011-02/msg01557.html
  
  we switched to clocksource=pit in /etc/default/grub's $GRUB_CMDLINE_XEN on
  the dom0. This seemed to have avoided the problem, but since then, the clock
  jumps started happening like this:
  
  Dec 21 19:42:23 dom0machine kernel: [6034768.658836] Clocksource tsc 
  unstable (delta = -811538856601 ns)
  
  In addition, now I checked what the said machine thinks is its clocksource:
  
  % cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
  /sys/devices/system/clocksource/clocksource0/available_clocksource
  xen
  xen
  
  So there's neither pit nor tsc in the available list :)
 
 A PV kernel will (or should) always use xen as it's clocksource. This
 is a PV timesource based around the TSC + correction factors (to account
 for drift and PCPU migration).
 
 The clocksource=pit on the hypervisor command line controls the
 hypervisor's own timesource and not the dom0 kernels. I'm not sure how
 you query the hypervisor for its timesource but I guess it'll be in xl
 dmesg somewhere (Platform timer is ...).

Ah, d'oh :) sorry, I wasn't really thinking.

The xm dmesg output on HP DL360 machines that we have set to clocksource=pit
and that have nevertheless happened to shifted by more than 35996 seconds
in at least five incidents in the last six months says:

(XEN) Platform timer is 1.193MHz PIT

On a couple of FS RX300's that happened not to have clocksource=pit set but
had time shift by 2999.69 seconds it's this:

(XEN) Platform timer is 14.318MHz HPET

Both also show the following message after the time shift:

(XEN) Platform timer appears to have unexpectedly wrapped 10 or more times.


 The message you quote above says *tsc* unstable. Prior to that was the
 system actually using the tsc clocksource? It really shouldn't have
 been... Before that message did available_clocksource contain TSC? What
 about current_clocksource? (Before here ~= on a freshly booted system)

The dom0 machines where we set clocksource=pit do see the sole xen
clocksource. That didn't stop the time from going awry.

On the dom0 machines that don't have the hypervisor fixated on
clocksource=pit:

* one dom0 that sees both xen and tsc in available_clocksource, but uses
  xen as current_clocksource. Not sure what it used at the time of the
  failure in September, probably the same because we didn't touch that. 
* one that recently failed has:

% dmesg | grep unstable
[4613030.883101] Clocksource tsc unstable (delta = -2999660301416 ns)
% cat /sys/devices/system/clocksource/clocksource0/*
xen
xen

 What are your exact hypervisor and kernel command lines? Other than
 clocksource=pit are you overriding anything else in this regard?

Most of the machines now seem to have:

GRUB_CMDLINE_LINUX=console=tty0 console=ttyS1,115200n1 elevator=deadline
GRUB_CMDLINE_XEN=dom0_mem=512M clocksource=pit cpuidle=0

The machines without clocksource=pit only had dom0_mem=512M for the
hypervisor and nothing for the dom0 kernel.

 Can you press the 's' hypervisor debug key and report the resulting text
 from dmesg. (press a debug key == xl debug-key s + xl dmesg or press
 Ctrl-A 3 times on serial then press 's').

(Note that I used xm for both of those commands, I don't have xl.)

This is the output on a couple of of the DL360's with clocksource=pit:

(XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=3066 
(count=1)
(XEN) dom2: mode=0,ofs=0x21e231c896,khz=2333479,inc=1,vtsc count: 10647611967 
kernel, 454486411 user
(XEN) dom12: mode=0,ofs=0x21a01e68ddeb,khz=2333479,inc=1,vtsc count: 2478607037 
kernel, 199833427 user
(XEN) dom17: mode=0,ofs=0x8d12c3820bf0b,khz=2333479,inc=1,vtsc count: 918220049 
kernel, 56818086 user
(XEN) dom18: mode=0,ofs=0x8d1334e2f635f,khz=2333479,inc=1,vtsc count: 
4707785417 kernel, 197043637 user
(XEN) dom21: mode=0,ofs=0x1004cc1e5bf801,khz=2333479,inc=1,vtsc count: 
6386763431 kernel, 166512523 user
(XEN) dom22: mode=0,ofs=0x14b5955232a7e1,khz=2333479,inc=1,vtsc count: 
2218555643 kernel, 88962103 user

(XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=1715 
(count=1)
(XEN) dom1: mode=0,ofs=0x149170bd5f,khz=2333479,inc=1,vtsc count: 36234921552 
kernel, 294922844 user

This is the output on an RX300 without clocksource=pit:

(XEN) TSC marked as reliable, warp = 0 (count=2)
(XEN) dom1: mode=0,ofs=0x59e046806,khz=2400116,inc=1
(XEN) No domains have emulated TSC

And finally this is the output on the odd machine that has tsc as an
available clock source:

(XEN) TSC marked as reliable, warp = 0 (count=2)
(XEN) dom1: mode=0,ofs=0x593b1f9e8,khz=2400190,inc=1
(XEN) dom4: mode=0,ofs=0xf3c77d49e41e6,khz=2400190,inc=1
(XEN) No domains have emulated TSC

In the latter case, I've no idea why the domU with the ID 4 

Bug#599161: ditto

2012-01-03 Thread Ian Campbell
On Wed, 2011-12-28 at 01:49 +0100, Josip Rodin wrote:
 This clock jump by 2999 seconds also happened here, so per:
 
 http://old-list-archives.xen.org/archives/html/xen-devel/2011-02/msg01557.html
 
 we switched to clocksource=pit in /etc/default/grub's $GRUB_CMDLINE_XEN on
 the dom0. This seemed to have avoided the problem, but since then, the clock
 jumps started happening like this:
 
 Dec 21 19:42:23 dom0machine kernel: [6034768.658836] Clocksource tsc unstable 
 (delta = -811538856601 ns)
 
 In addition, now I checked what the said machine thinks is its clocksource:
 
 % cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
 /sys/devices/system/clocksource/clocksource0/available_clocksource
 xen
 xen
 
 So there's neither pit nor tsc in the available list :)

A PV kernel will (or should) always use xen as it's clocksource. This
is a PV timesource based around the TSC + correction factors (to account
for drift and PCPU migration).

The clocksource=pit on the hypervisor command line controls the
hypervisor's own timesource and not the dom0 kernels. I'm not sure how
you query the hypervisor for its timesource but I guess it'll be in xl
dmesg somewhere (Platform timer is ...).

The message you quote above says *tsc* unstable. Prior to that was the
system actually using the tsc clocksource? It really shouldn't have
been... Before that message did available_clocksource contain TSC? What
about current_clocksource? (Before here ~= on a freshly booted system)

What are your exact hypervisor and kernel command lines? Other than
clocksource=pit are you overriding anything else in this regard?

Can you press the 's' hypervisor debug key and report the resulting text
from dmesg. (press a debug key == xl debug-key s + xl dmesg or press
Ctrl-A 3 times on serial then press 's').

It seems odd that the only reports we see of this issue is with Debian
Squeeze. It's possible that the snapshot of pvops which made it into
squeeze had some issue but I've just looked over the diff between that
and the current xen 2.6.32 pvops kernel and don't see anything obviously
time related. Perhaps this is a bug in Xen 4.0.x rather than the kernel?

If someone who can reproduce could try (separately) a new kernel and new
hypervisor that might help narrow it down.

Another option instead of clocksource= might be to try tsc=[unstable|
skewed]. Quoth the comment:
/*
 * tsc=unstable: Override all tests; assume TSC is unreliable.
 * tsc=skewed: Assume TSCs are individually reliable, but skewed across 
CPUs.
 */

Ian.
-- 
Ian Campbell
Current Noise: Today Is The Day - Pain Is A Warning

A good marriage would be between a blind wife and deaf husband.
-- Michel de Montaigne




-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/1325598164.25206.136.ca...@zakaz.uk.xensource.com



Bug#599161: ditto

2011-12-27 Thread Josip Rodin

This clock jump by 2999 seconds also happened here, so per:

http://old-list-archives.xen.org/archives/html/xen-devel/2011-02/msg01557.html

we switched to clocksource=pit in /etc/default/grub's $GRUB_CMDLINE_XEN on
the dom0. This seemed to have avoided the problem, but since then, the clock
jumps started happening like this:

Dec 21 19:42:23 dom0machine kernel: [6034768.658836] Clocksource tsc unstable 
(delta = -811538856601 ns)

In addition, now I checked what the said machine thinks is its clocksource:

% cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
/sys/devices/system/clocksource/clocksource0/available_clocksource
xen
xen

So there's neither pit nor tsc in the available list :)

-- 
 2. That which causes joy or happiness.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20111228004915.ga21...@entuzijast.net