On Tue, Jan 03, 2012 at 01:42:38PM +0000, Ian Campbell wrote: > On Wed, 2011-12-28 at 01:49 +0100, Josip Rodin wrote: > > This clock jump by 2999 seconds also happened here, so per: > > > > http://old-list-archives.xen.org/archives/html/xen-devel/2011-02/msg01557.html > > > > we switched to clocksource=pit in /etc/default/grub's $GRUB_CMDLINE_XEN on > > the dom0. This seemed to have avoided the problem, but since then, the clock > > jumps started happening like this: > > > > Dec 21 19:42:23 dom0machine kernel: [6034768.658836] Clocksource tsc > > unstable (delta = -811538856601 ns) > > > > In addition, now I checked what the said machine thinks is its clocksource: > > > > % cat /sys/devices/system/clocksource/clocksource0/current_clocksource > > /sys/devices/system/clocksource/clocksource0/available_clocksource > > xen > > xen > > > > So there's neither pit nor tsc in the available list :) > > A PV kernel will (or should) always use "xen" as it's clocksource. This > is a PV timesource based around the TSC + correction factors (to account > for drift and PCPU migration). > > The clocksource=pit on the hypervisor command line controls the > hypervisor's own timesource and not the dom0 kernels. I'm not sure how > you query the hypervisor for its timesource but I guess it'll be in "xl > dmesg" somewhere ("Platform timer is ...").
Ah, d'oh :) sorry, I wasn't really thinking. The xm dmesg output on HP DL360 machines that we have set to clocksource=pit and that have nevertheless happened to shifted by more than 35996 seconds in at least five incidents in the last six months says: (XEN) Platform timer is 1.193MHz PIT On a couple of FS RX300's that happened not to have clocksource=pit set but had time shift by 2999.69 seconds it's this: (XEN) Platform timer is 14.318MHz HPET Both also show the following message after the time shift: (XEN) Platform timer appears to have unexpectedly wrapped 10 or more times. > The message you quote above says *tsc* unstable. Prior to that was the > system actually using the tsc clocksource? It really shouldn't have > been... Before that message did available_clocksource contain TSC? What > about current_clocksource? ("Before" here ~= on a freshly booted system) The dom0 machines where we set clocksource=pit do see the sole "xen" clocksource. That didn't stop the time from going awry. On the dom0 machines that don't have the hypervisor fixated on clocksource=pit: * one dom0 that sees both "xen" and "tsc" in available_clocksource, but uses "xen" as current_clocksource. Not sure what it used at the time of the failure in September, probably the same because we didn't touch that. * one that recently failed has: % dmesg | grep unstable [4613030.883101] Clocksource tsc unstable (delta = -2999660301416 ns) % cat /sys/devices/system/clocksource/clocksource0/* xen xen > What are your exact hypervisor and kernel command lines? Other than > clocksource=pit are you overriding anything else in this regard? Most of the machines now seem to have: GRUB_CMDLINE_LINUX="console=tty0 console=ttyS1,115200n1 elevator=deadline" GRUB_CMDLINE_XEN="dom0_mem=512M clocksource=pit cpuidle=0" The machines without clocksource=pit only had dom0_mem=512M for the hypervisor and nothing for the dom0 kernel. > Can you press the 's' hypervisor debug key and report the resulting text > from dmesg. (press a debug key == "xl debug-key s" + "xl dmesg" or press > Ctrl-A 3 times on serial then press 's'). (Note that I used xm for both of those commands, I don't have xl.) This is the output on a couple of of the DL360's with clocksource=pit: (XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=3066 (count=1) (XEN) dom2: mode=0,ofs=0x21e231c896,khz=2333479,inc=1,vtsc count: 10647611967 kernel, 454486411 user (XEN) dom12: mode=0,ofs=0x21a01e68ddeb,khz=2333479,inc=1,vtsc count: 2478607037 kernel, 199833427 user (XEN) dom17: mode=0,ofs=0x8d12c3820bf0b,khz=2333479,inc=1,vtsc count: 918220049 kernel, 56818086 user (XEN) dom18: mode=0,ofs=0x8d1334e2f635f,khz=2333479,inc=1,vtsc count: 4707785417 kernel, 197043637 user (XEN) dom21: mode=0,ofs=0x1004cc1e5bf801,khz=2333479,inc=1,vtsc count: 6386763431 kernel, 166512523 user (XEN) dom22: mode=0,ofs=0x14b5955232a7e1,khz=2333479,inc=1,vtsc count: 2218555643 kernel, 88962103 user (XEN) TSC has constant rate, deep Cstates possible, so not reliable, warp=1715 (count=1) (XEN) dom1: mode=0,ofs=0x149170bd5f,khz=2333479,inc=1,vtsc count: 36234921552 kernel, 294922844 user This is the output on an RX300 without clocksource=pit: (XEN) TSC marked as reliable, warp = 0 (count=2) (XEN) dom1: mode=0,ofs=0x59e046806,khz=2400116,inc=1 (XEN) No domains have emulated TSC And finally this is the output on the odd machine that has tsc as an available clock source: (XEN) TSC marked as reliable, warp = 0 (count=2) (XEN) dom1: mode=0,ofs=0x593b1f9e8,khz=2400190,inc=1 (XEN) dom4: mode=0,ofs=0xf3c77d49e41e6,khz=2400190,inc=1 (XEN) No domains have emulated TSC In the latter case, I've no idea why the domU with the ID 4 would be using a different clock source - we certainly didn't set it up in any such special manner, it's been generated and booted like all others. Within this domU machine, there's: % cat /sys/devices/system/clocksource/clocksource0/* xen tsc xen So it looks like we consistently use the xen clocksource. > Another option instead of clocksource= might be to try tsc=[unstable| > skewed]. Quoth the comment: > /* > * tsc=unstable: Override all tests; assume TSC is unreliable. > * tsc=skewed: Assume TSCs are individually reliable, but skewed > across CPUs. > */ This is also for the hypervisor, right? In any case, I don't quite see what tsc=unstable would bring us - we see problems both on cases where TSC is marked as reliable and as unreliable, it's just a different shift value :) -- 2. That which causes joy or happiness. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org