In article <[EMAIL PROTECTED]>, Joe Harvell <[EMAIL PROTECTED]> wrote:
> This actually happened in a testbed for our application. NTP stats show * that over the course of 22 days, the offsets of two configured NTP * servers (both ours) serving one of our NTP clients started diverging * up to a maximum distance of 800 seconds. During this time, our NTP This could only happen if either the implementation was broken, or they were mis-using the local clock pseudo reference clock. If the servers were using a proper reference clock as their primary source, root dispersion would have exceeded it's maximum value when the error was probably a lot less than a second and the servers would have been rejected completely. Configuring a local clock breaks this process, so should never be done as default (even though distributors like doing this). In many cases, it is best not to have a local reference clock configured at all. If you do have more than one configured, you should arrange make each server have a different stratum, with steps of two between them, so that there is a a well defined priority amongst the different machines. If you don't have any real reference clocks in the overall network, it is even more important that there is normally only one possible choice of local clock reference. Having two local clock references that are diverging violates the fundamental principle that all NTP times are traceable to a single (and preferably UTC) time. * client stepped its clock forward 940 times and backwards 803 times, * with increasing magnitudes up to ~400 seconds. The problem went away * when someone "added an IP address to the configuration of one of the * NTP servers." (I am still trying to determine exactly what happened). That sounds like that server had a local reference clock as fallback. * The ntp.conf files of the NTP client, the stats, and a nice graph of * the offsets is found at http://dingo.dogpad.net/ntpProblem/. > I concede that only having 2 NTP servers for our host made this problem * more likely to occur. But considering the mayhem caused by jerking the * clock back and forth every 15 minues for 22 days, I think it is worth * investigating whether to eliminate stepping altogether. 15 minutes sounds like the verification before ntpd becomes convinced that its time really is seriously wrong. > I still don't understand why the clock was being stepped back and forth. * One of the NTP servers showed up with 80f4 (unreachable) status every * 15 minutes for the entire 22 days, but with 90f4 (reject) and 96f4 * (sys.peer) in between. Oddly, this server was one of two servers, * but the *other* server was the preferred peer. I wonder why this peer Normally, I believe, if you have just two servers and they have non- intersecting error bounds, they will both be rejected and the system will free run. However, I think that prefer confuses the issue, by not allowing the preferred one to be discarded. I have a feeling this is actually done by saying that the system stops discarding when it would discard that one. I suppose that the other one could still be in contention at that point. * would ever be selected as the sys.peer since the prefer peer was only * reported unreachable 10 times over this 22 day period. Would this be * because the selection algorithm finds no intersection? > Maybe the behavior I saw was a bug, and not the expected consequence of * a failure scenario in which 2 NTP servers have diverging clocks. The expected behaviour is that this has happened because one is giving a false time and the other is giving UTC time. The remaining servers will also give UTC time, so the bad one will get voted out. I don't think prefer is intended to deal with broken clocks, only with more accurate ones. _______________________________________________ questions mailing list [email protected] https://lists.ntp.isc.org/mailman/listinfo/questions
