You are suggesting "improvements" to a chrony misbehaviour that no longer
exists in the newer versions. Use a newer version and see if you can duplicate
the problem. Fixing non-existant problems is sure to introduce new problems.




William G. Unruh __| Canadian Institute for|____ Tel: +1(604)822-3273
Physics&Astronomy _|___ Advanced Research _|____ Fax: +1(604)822-5324
UBC, Vancouver,BC _|_ Program in Cosmology |____ un...@physics.ubc.ca
Canada V6T 1Z1 ____|____ and Gravity ______|_ www.theory.physics.ubc.ca/

On Thu, 9 Jun 2022, Franke, Daniel wrote:

[CAUTION: Non-UBC Email]

I recently observed some pathological behavior by chrony on a system that was 
thrashing under memory pressure. The system was running an older version of 
chrony which didn't have 
https://git.tuxfamily.org/chrony/chrony.git/commit/?id=59e8b790341f344e07cb4d5124e7dc89de6665a1,
 and underwent a failure mode substantially identical to the one in Gruener's 
original report which motivated that patch. Chrony was configured with a short 
polling interval, the thrashing caused long delays in chrony getting scheduled, 
and the backup of timeout events triggered the false-positive infinite loop 
detection and chrony crashed. Running a fully-patched version of chrony would 
have prevented the crash, but what's interesting is what happened afterward: 
the clock drifted by several minutes over the course of less than an hour, 
suggesting that at the time of the crash, chrony was slewing the clock at a 
rate at or approaching the 83333ppm limit imposed by `maxslewrate`. What I 
believe happened is that scheduling delays were causing chrony's clock slews to 
get applied for more than double the intended time, so that the clock overshot 
and ended up off by more than it began, in the opposite direction. Because 
chrony applies large corrections more quickly and aggressively than small ones, 
this created a positive-feedback death spiral of increasingly large slews 
requiring decreasingly long delays to perpetuate the oscillation.

This is clearly not good behavior, and there are a couple ways it could be 
improved. On systems where `adjtimex` is available, clock slews can be performed 
by manipulating the `offset` field rather than `freq` and `tick` (just like ntpd 
does). This would hand off the job of stopping the slew at the appropriate time to 
the kernel and completely prevent this kind of overshoot. On systems like OpenBSD 
where you only have `adjfreq` or similar (or everywhere, if you think my first 
suggestion is too extreme a change), chrony could at least detect the overshoot 
after the fact and temporarily back off on the maximum slew rate to prevent the 
oscillation from perpetuating. For example, if it's detected that any of the last 
several slews got applied for t seconds longer than intended, don't plan to apply 
the next slew for any less than k*t seconds, for some k>1.

Reply via email to