Taylor R Campbell <riastr...@netbsd.org> writes:

>> Date: Sun, 30 Jul 2023 14:56:53 -0400
>> From: Brad Spencer <b...@anduin.eldar.org>
>> 
>> Taylor R Campbell <riastr...@netbsd.org> writes:
>> 
>> > Can you please try running with the attached patch and share the
>> > warnings it produces?  Should give slightly more information.
>> 
>> Caught another one.  As far as I know the system is up to date with all
>> of the requested patches:
>> 
>> [ 19419.647972] WARNING: lwp 16 (system idle/1) flags 0xa0000020: 
>> timecounter went backwards from (19420 + 0x9e37cf0149d8f7bb/2^64) sec at 
>> netbsd:mi_switch+0x11e on cpu1 to (19419 + 0xad917b77bd0a7cd3/2^64) sec at 
>> netbsd:mi_switch+0x11e on cpu1
>
> Can you run this dtrace script for a while (say, for a day, or from
> start of boot until you see the WARNING above which only happens once
> per boot), and then hit ^C?
>
> dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } 
> sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated 
> timecounter contract") }'
>
> If my hypothesis is correct, you can just leave this running over any
> particular workload and you'll get:
>
> (a) a message printed whenever the hardclock delay is too long, and
> (b) when you hit ^C at the end, a histogram of all the >1-tick
>     hardclock jump delays.
>
> (Avoiding the tick-10s probe, like I used in the last dtrace
> suggestion, means you won't get updates printed every 10sec to your
> terminal -- you'll have to hit ^C to see the results -- but as an
> upside it won't instantly crash your kernel owing to the Xen/!Xen
> module ABI mismatch for CLKF_USERMODE/PC.)

Struggling a little with this one...

The usual work load I have been doing is a couple of world builds at a
time sometimes followed by a cvs update of -current and NetBSD 10.x and
a rebuild of cscope indexes.  I suspect, but can't confirm, that one of
the recent patches (probably the one to call hardclock more often, if I
remember correctly) is causing some trouble with stability.  The system
starts to act pretty bad mostly when accessing disk after the world
builds are done.  Often the cvs update will not complete at all.  I can
start the above dtrace, but it will often exit with the following:

# dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } 
sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated 
timecounter contract") }'
dtrace: description 'sdt:xen:hardclock:jump ' matched 2 probes
dtrace: processing aborted: Abort due to systemic unresponsiveness

The system is fine just after a reboot, it certainly seems to be a
requirment that a fair bit of work must be done before it gets into a
bad state.

If the dtrace does continue to run, sometimes, it is impossible to exit
with CTRL-C.  The process seems stuck in this:

[ 4261.7158728] load: 2.64  cmd: dtrace 3295 [xclocv] 0.01u 0.02s 0% 7340k


At that point, the guest must be destroyed and rebooted as more and more
processes will start to get stuck.

If I don't put just too much of a heavy load on the system (or don't
wait to long after loading up the system) I can interrupt the dtrace and
it does produce a graph.  Here is one after about 2.5 hours of world
building:

# dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } 
sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated 
timecounter contract") }'
dtrace: description 'sdt:xen:hardclock:jump ' matched 2 probes
^C


           value  ------------- Distribution ------------- count    
         8388608 |                                         0        
        16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 220301   
        33554432 |                                         0        


I saw the WARNING about negative runtime just once so far again, but the
dtrace had exited with the "systemic unresponsiveness" message and I
could not interrupt it.




-- 
Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org

Reply via email to