Taylor R Campbell <riastr...@netbsd.org> writes: >> Date: Sun, 30 Jul 2023 14:56:53 -0400 >> From: Brad Spencer <b...@anduin.eldar.org> >> >> Taylor R Campbell <riastr...@netbsd.org> writes: >> >> > Can you please try running with the attached patch and share the >> > warnings it produces? Should give slightly more information. >> >> Caught another one. As far as I know the system is up to date with all >> of the requested patches: >> >> [ 19419.647972] WARNING: lwp 16 (system idle/1) flags 0xa0000020: >> timecounter went backwards from (19420 + 0x9e37cf0149d8f7bb/2^64) sec at >> netbsd:mi_switch+0x11e on cpu1 to (19419 + 0xad917b77bd0a7cd3/2^64) sec at >> netbsd:mi_switch+0x11e on cpu1 > > Can you run this dtrace script for a while (say, for a day, or from > start of boot until you see the WARNING above which only happens once > per boot), and then hit ^C? > > dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } > sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated > timecounter contract") }' > > If my hypothesis is correct, you can just leave this running over any > particular workload and you'll get: > > (a) a message printed whenever the hardclock delay is too long, and > (b) when you hit ^C at the end, a histogram of all the >1-tick > hardclock jump delays. > > (Avoiding the tick-10s probe, like I used in the last dtrace > suggestion, means you won't get updates printed every 10sec to your > terminal -- you'll have to hit ^C to see the results -- but as an > upside it won't instantly crash your kernel owing to the Xen/!Xen > module ABI mismatch for CLKF_USERMODE/PC.)
Struggling a little with this one... The usual work load I have been doing is a couple of world builds at a time sometimes followed by a cvs update of -current and NetBSD 10.x and a rebuild of cscope indexes. I suspect, but can't confirm, that one of the recent patches (probably the one to call hardclock more often, if I remember correctly) is causing some trouble with stability. The system starts to act pretty bad mostly when accessing disk after the world builds are done. Often the cvs update will not complete at all. I can start the above dtrace, but it will often exit with the following: # dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated timecounter contract") }' dtrace: description 'sdt:xen:hardclock:jump ' matched 2 probes dtrace: processing aborted: Abort due to systemic unresponsiveness The system is fine just after a reboot, it certainly seems to be a requirment that a fair bit of work must be done before it gets into a bad state. If the dtrace does continue to run, sometimes, it is impossible to exit with CTRL-C. The process seems stuck in this: [ 4261.7158728] load: 2.64 cmd: dtrace 3295 [xclocv] 0.01u 0.02s 0% 7340k At that point, the guest must be destroyed and rebooted as more and more processes will start to get stuck. If I don't put just too much of a heavy load on the system (or don't wait to long after loading up the system) I can interrupt the dtrace and it does produce a graph. Here is one after about 2.5 hours of world building: # dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated timecounter contract") }' dtrace: description 'sdt:xen:hardclock:jump ' matched 2 probes ^C value ------------- Distribution ------------- count 8388608 | 0 16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 220301 33554432 | 0 I saw the WARNING about negative runtime just once so far again, but the dtrace had exited with the "systemic unresponsiveness" message and I could not interrupt it. -- Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org