[ntp:questions] Re: tinker step 0 (always slew) and kernel time discipline

user Mon, 25 Sep 2006 11:30:29 -0700

Joe,

First of all, you misunderstand what the prefer keyword is for and whatit is intended to do. It is not applicable to your scenario. As forjerking back and forth every 15 minutes, something is seriously brokenwith the hardware, either a stuck bit or kernel problem. Consider thestep actions as a temporal canary. Considering the rather large numberof servers around here and the national labs, if a step ever occurs, thehardware is to blame.

Second, you apparently are using two servers that diverge widely abouttheir times. The clients will be most confused as to which of theservers is trustable. This is not a step problem, it is a fatalcondition for the applications. If the divergence is due to configuringboth servers with the local clock driver, this violates the principlethat all servers cling to the same timescale, UTC or synthetic. If youreally need to have redundant servers that cling to the same synthetictimescale, configure both servers in orphan mode and symmetric activemode with each other. Do not use the local clock driver.

A better choice is to have three servers configured as above. If one ofthem sails to the sunset, a majority clique is still possible. If onlytwo servers and one of them sails away, the clients cannot form amajoity clique and will conclude neither of them is sane.

Above all, if you are serious about the integrity of the time functionand believe in Lamport's happens-before relation, as interpreted by NTP,take very seriously the topics discussed in the white papers linked fromthe NTP project page. Also, there should be no excuse for not detectingand responding to a scenario where servers can show serious disagreementwithout being reported to your beeper. That's how the NIST servers aremonitored.


Dave

Joe Harvell wrote:

David L. Mills wrote:
<snip>
5. If for some reason the server(s) are not reachable at startup andthe applications must start, then I would assume the applicationswould fail, since the time is not synchronized. If the applicationsuse the NTP system primatives, the synchronization condition isreadily apparent in the return code. Since they can't run anyway,there is no harm in stepping the clock, no matter what the initialoffset. Forcing a slew in this case would seem highly undesirable,unless the application can tolerate large differences between clocksand, in that case, using ntpd is probably a poor choice in the firstplace.
I agree that the condition of no time servers reachable on startup isthe most common case where a large offset will eventually be observed.I agree that the application should detect this and fail before startingup. I am concerned about clock and network failure scenarios that causean NTP client to see two different NTP servers with very different times.
This actually happened in a testbed for our application. NTP stats showthat over the course of 22 days, the offsets of two configured NTPservers (both ours) serving one of our NTP clients started diverging upto a maximum distance of 800 seconds. During this time, our NTP clientstepped its clock forward 940 times and backwards 803 times, withincreasing magnitudes up to ~400 seconds. The problem went away whensomeone "added an IP address to the configuration of one of the NTPservers." (I am still trying to determine exactly what happened). Thentp.conf files of the NTP client, the stats, and a nice graph of theoffsets is found at http://dingo.dogpad.net/ntpProblem/.
I concede that only having 2 NTP servers for our host made this problemmore likely to occur. But considering the mayhem caused by jerking theclock back and forth every 15 minues for 22 days, I think it is worthinvestigating whether to eliminate stepping altogether.
I still don't understand why the clock was being stepped back andforth. One of the NTP servers showed up with 80f4 (unreachable) statusevery 15 minutes for the entire 22 days, but with 90f4 (reject) and 96f4(sys.peer) in between. Oddly, this server was one of two servers, butthe *other* server was the preferred peer. I wonder why this peer wouldever be selected as the sys.peer since the prefer peer was only reportedunreachable 10 times over this 22 day period. Would this be because theselection algorithm finds no intersection?
Maybe the behavior I saw was a bug, and not the expected consequence ofa failure scenario in which 2 NTP servers have diverging clocks.


_______________________________________________
questions mailing list
[email protected]
https://lists.ntp.isc.org/mailman/listinfo/questions

[ntp:questions] Re: tinker step 0 (always slew) and kernel time discipline

Reply via email to