On Thu, Jun 09, 2022 at 03:43:27PM +0000, Franke, Daniel wrote: > What I believe happened is that scheduling delays were causing > chrony's clock slews to get applied for more than double the > intended time, so that the clock overshot and ended up off by more > than it began, in the opposite direction. Because chrony applies > large corrections more quickly and aggressively than small ones, > this created a positive-feedback death spiral of increasingly large > slews requiring decreasingly long delays to perpetuate the > oscillation.
That sounds plausible to me. The minimum length of the slew is 1 second. The actual interval would need to be more than double for the oscillations to amplify. Can a system where processes are delayed by more than 1 second still be doing something useful? If it was a temporary issue, I'd expect chronyd to recover. > This is clearly not good behavior, and there are a couple ways it could be > improved. On systems where `adjtimex` is available, clock slews can be > performed by manipulating the `offset` field rather than `freq` and `tick` > (just like ntpd does). This would hand off the job of stopping the slew at > the appropriate time to the kernel and completely prevent this kind of > overshoot. The singleshot adjustment (aka adjtime()) on Linux is too slow (500 ppm) to be useful for chronyd. It is used on some other systems like FreeBSD, where it can go up to 5000 ppm while the ntp_adjtime() frequency is limited to 500 ppm. The ntp_adjtime() PLL offset could be used on Linux for slewing, and some earlier versions of chronyd did that, but it has some issues that is better to avoid even if it means the slew will overshoot when chronyd is not able to stop it at the right time. > On systems like OpenBSD where you only have `adjfreq` or similar (or > everywhere, if you think my first suggestion is too extreme a > change), chrony could at least detect the overshoot after the fact > and temporarily back off on the maximum slew rate to prevent the > oscillation from perpetuating. For example, if it's detected that > any of the last several slews got applied for t seconds longer than > intended, don't plan to apply the next slew for any less than k*t > seconds, for some k>1. It would need to avoid false positives, e.g. when the system is suspended and resumed from disk/RAM. I'd prefer simplicity. There is already some code detecting unexpected clock jumps larger than 10 seconds, which should reset almost everything, including currently running slew. Maybe the extra slew interval could be limited to those 10 seconds. I'll look into that. Thanks, -- Miroslav Lichvar -- To unsubscribe email chrony-dev-requ...@chrony.tuxfamily.org with "unsubscribe" in the subject. For help email chrony-dev-requ...@chrony.tuxfamily.org with "help" in the subject. Trouble? Email listmas...@chrony.tuxfamily.org.