You are suggesting "improvements" to a chrony misbehaviour that no longer
exists in the newer versions. Use a newer version and see if you can duplicate
the problem.
Fixing non-existant problems is sure to introduce new problems.
William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ un...@physics.ubc.ca
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/
On Thu, 9 Jun 2022, Franke, Daniel wrote:
[CAUTION: Non-UBC Email]
I recently observed some pathological behavior by chrony on a system that was
thrashing under memory pressure. The system was running an older version of
chrony which didn't have
https://git.tuxfamily.org/chrony/chrony.git/commit/?id=59e8b790341f344e07cb4d5124e7dc89de6665a1,
and underwent a failure mode substantially identical to the one in Gruener's
original report which motivated that patch. Chrony was configured with a short
polling interval, the thrashing caused long delays in chrony getting scheduled,
and the backup of timeout events triggered the false-positive infinite loop
detection and chrony crashed. Running a fully-patched version of chrony would
have prevented the crash, but what's interesting is what happened afterward:
the clock drifted by several minutes over the course of less than an hour,
suggesting that at the time of the crash, chrony was slewing the clock at a
rate at or approaching the 83333ppm limit imposed by `maxslewrate`. What I
believe happened is that scheduling delays were causing chrony's clock slews to
get applied for more than double the intended time, so that the clock overshot
and ended up off by more than it began, in the opposite direction. Because
chrony applies large corrections more quickly and aggressively than small ones,
this created a positive-feedback death spiral of increasingly large slews
requiring decreasingly long delays to perpetuate the oscillation.
This is clearly not good behavior, and there are a couple ways it could be
improved. On systems where `adjtimex` is available, clock slews can be performed
by manipulating the `offset` field rather than `freq` and `tick` (just like ntpd
does). This would hand off the job of stopping the slew at the appropriate time to
the kernel and completely prevent this kind of overshoot. On systems like OpenBSD
where you only have `adjfreq` or similar (or everywhere, if you think my first
suggestion is too extreme a change), chrony could at least detect the overshoot
after the fact and temporarily back off on the maximum slew rate to prevent the
oscillation from perpetuating. For example, if it's detected that any of the last
several slews got applied for t seconds longer than intended, don't plan to apply
the next slew for any less than k*t seconds, for some k>1.