Re: [ntp:questions] NTP vs chrony comparison (Was: oscillations in ntp clock synchronization)
[EMAIL PROTECTED] (Danny Mayer) writes: David L. Mills wrote: Danny, It doesn't stop working; it just clamps whatever it gets to +-500 PPM as appropriate. If the intrinsic error is greater than 500 PPM, the loop will do what it can with the residual it can't correct showing as a systematic time ofset. Dave I didn't mean to suggest that ntpd stopped running. It was that the clock was drifting steadily off into the sunset. I realize that if the problem corrected itself ntpd would bring things back to normal. But that suggests that the drift rate of your chip became bigger than 500PPM, which is huge. Maybe something altered the tick size inappropriately. ntp should have hauled the offset back to zero -- just taking a longer time ( 100msec at 500PPM takes about 200 sec to eliminate-- which is not that long.) Danny Danny Mayer wrote: David L. Mills wrote: Danny, Unless the computer clock intrinsic frequency error is huge, the only time the 500-PPM kicks in is with a 100-ms step transient and poll interval 16 s. The loop still works if it hits the stops; it just can't drive the offset to zero. Dave Yes, I found this out when my laptop stopped disciplined the clock and was complaining about the frequency limits and I started digging into the code to figure out why. Danny ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] NTP vs chrony comparison (Was: oscillations in ntp clock synchronization)
[EMAIL PROTECTED] (David Malone) writes: Unruh [EMAIL PROTECTED] writes: weekends. Lots of power at 10^-5 Hz and harmonics, and .7 10^-8Hz.-- more than would be predicted by 1/f 10^-5Hz is about once per day. I'm not sure what .7 10^8Hz is - it seems to be about once every 4.5 years? I would have assumed you'd get power around 10^-5Hz (daily), 10^-6 Hz (weekly) and maybe 3x10^-8 (yearly) based on a mix of enviromental factors (air conditioning/heating) and usage? Yes, that was supposed to be 1/week. David. ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] strange behaviour of ntp peerstats entries.
David L. Mills [EMAIL PROTECTED] writes: Danny, True; there is an old RFC or IEN that reports the results with varying numbers of clock filter stages, from which the number eight was the best. Keep in mind these experiments were long ago and with, as I remember, ARPAnet sources. The choice might be different today, but probably would not result in great improvment in the general cases. Note however that the popcorn spike supressor is a very real Internet add-on. Oh yes. popcorn suppression is important. I agree. But the filter goes well beyond that. My eaction is that on the one hand people keep saying how important net load is, and that one does not want to use poll intervals that are much smaller than 8 or 10, and on the other hand, throwing away 80-90% of the data collected. Remin ds me of the story of Saul, king of the Israelites, whose army was besieged, and he mentioned that he was thirsty. A few of his soldiers risked everything to get through the enemy lines and bring him water. He was so impressed that he poured it all out on the ground, in tribute to their courage. I have always found that story an incredible insult to the bravery instead. The procedure does drastically reduce the variance of the delay, but does not much for the variance of the offset, which is of coure what is important. Just to bring up chrony again, it uses both a suppression where round trips greater than say 1.5 of min are discarded, and data is weighted by some power of the invere of the delay. The number of stages may have unforseen consequences. The filter can (and often does) introduce additional delay in the feedback loop. The loop time constant takes this into account so the impulse response is only marginaly affected. So, the loop is really engineered for good response with one accepted sample in eight. Audio buffs will recognize any additional aamples only improve the response, since they amount to oversampling the signal. Audio buffs will also recognize the need for zeal in avoiding undersampling, which is why the poll-adjust algorithm is so squirrely. Dave Danny Mayer wrote: Unruh wrote: [EMAIL PROTECTED] (Danny Mayer) writes: Unruh wrote: Brian Utterback [EMAIL PROTECTED] writes: Unruh wrote: David L. Mills [EMAIL PROTECTED] writes: You might not have noticed a couple of crucial issues in the clock filter code. I did notice them all. Thus my caveate. However throwing away 80% of the precious data you have seems excessive. Note that the situation can arise that the one can wait many more than 8 samples for another one. Say sample i is a good one. and remains the best for the next 7 tries. Sample i+7 is slightly worse than sample i and thus it is not picked as it comes in. But the next i samples are all worse than it. Thus it remains the filtered one, but is never used because it was not the best when it came in. This situation could keep going for a long time, meaning that ntp suddenly has no data to do anything with for many many poll intervals. Surely using sample i+7 is far better than not using any data for that length of time. On the contrary, it's better not to use the data at all if its suspect. ntpd is designed to continue to work well even in the event of loosing all access to external sources for extended periods. And this could happen again. Now, since the delays are presumably random variables, the chances of this happening are not great ( although under a condition of gradually worsening network the chances are not that small), but since one is running ntp for millions or billions of samples, the chances of this happening sometime becomes large. There are quite a few ntpd servers which are isolated and once an hour use ACTS to fetch good time samples. This is not rare at all. And then promplty throw them away because they do not satify the minimum condition? No, it is not best to throw away data no matter how suspect. Data is a preecious comodity and should be thrown away only if you are damn sure it cannot help you. For example lets say that the change in delay is .1 of the variance of the clock. The max extra noise that delay can cause is about .01 Yet NTP will chuck it. Now if the delay is 100 times the variance, sure chuck it. It probably cannot help you. The delay is a random process, non-gaussian admitedly, and its effect on the time is also a random process-- usually much closer to gaussian. And why was the figure of 8 chosen ( the best of the last 8 tries) why not 1? or 3? I suspect it came off the top of someone's head-- lets not throuw away too much stuff, since it would make ntp unseable, but lets throw away some to feel virtuous. Sorry for being sarcastic, but I would really like to know what the justification was for throwing so much data away. No, 8 was chosen after a lot of experimentation to ensure the best results over a wide range of configurations. Dave has adjusted these numbers over the years and he's the person to ask.
Re: [ntp:questions] NTP vs chrony comparison (Was: oscillations in ntp clock synchronization)
Unruh wrote: [EMAIL PROTECTED] (Danny Mayer) writes: David L. Mills wrote: Danny, It doesn't stop working; it just clamps whatever it gets to +-500 PPM as appropriate. If the intrinsic error is greater than 500 PPM, the loop will do what it can with the residual it can't correct showing as a systematic time ofset. Dave I didn't mean to suggest that ntpd stopped running. It was that the clock was drifting steadily off into the sunset. I realize that if the problem corrected itself ntpd would bring things back to normal. But that suggests that the drift rate of your chip became bigger than 500PPM, which is huge. Maybe something altered the tick size inappropriately. ntp should have hauled the offset back to zero -- just taking a longer time ( 100msec at 500PPM takes about 200 sec to eliminate-- which is not that long.) No, it was something else entirely and not something that ntpd, chrony or any other application could do anything about. It's fixed now. Danny ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] NTP vs chrony comparison (Was: oscillations in ntp clock synchronization)
What was the problem? On Mon, 28 Jan 2008, Danny Mayer wrote: Unruh wrote: [EMAIL PROTECTED] (Danny Mayer) writes: David L. Mills wrote: Danny, It doesn't stop working; it just clamps whatever it gets to +-500 PPM as appropriate. If the intrinsic error is greater than 500 PPM, the loop will do what it can with the residual it can't correct showing as a systematic time ofset. Dave I didn't mean to suggest that ntpd stopped running. It was that the clock was drifting steadily off into the sunset. I realize that if the problem corrected itself ntpd would bring things back to normal. But that suggests that the drift rate of your chip became bigger than 500PPM, which is huge. Maybe something altered the tick size inappropriately. ntp should have hauled the offset back to zero -- just taking a longer time ( 100msec at 500PPM takes about 200 sec to eliminate-- which is not that long.) No, it was something else entirely and not something that ntpd, chrony or any other application could do anything about. It's fixed now. Danny -- William G. Unruh | Canadian Institute for| Tel: +1(604)822-3273 PhysicsAstronomy | Advanced Research | Fax: +1(604)822-5324 UBC, Vancouver,BC | Program in Cosmology | [EMAIL PROTECTED] Canada V6T 1Z1 | and Gravity | www.theory.physics.ubc.ca/ ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] quirky adjtimex behaviour [SOLVED]
Hello Dean and Hal, On Tuesday, January 22, 2008 at 1:08:00 +, Dean S. Messing wrote: hal-usenet wrote: try changing the code that reads the CMOS clock to spin in a loop reading it until it changes. That will give you the time early in the second. The adjtimex code is already designed to detect the exact beginning of an RTC second. Either via the /dev/rtc update-ended interrupt, or by busywaiting for the fall of the update-in-progress (UIP) flag. But nevertheless your analysis of facts seems good, Hal: This tick synchronisation probably fails for some unknown reason in Dean's case. I just replaced version 1.23 of adjtimex with an old version 1.20 and the quirky behaviour disappeared. I first noticed it on my new Fedora 7 with version 1.21. Interesting: adjtimex 1.21 was the first version using by default the /dev/rtc interrupt to detect the clock beat. The problem might be there. Adjtimex 1.23 has an option to force the UIP method: does it show the quirky offsets? | # adjtimex --utc --compare=20 --interval=10 --directisa Anyway the default /dev/rtc method is preferable. The 1.23 debug output may reveal what's up with your interrupts: | # adjtimex --utc --compare=1 --verbose Serge. -- Serge point Bets arobase laposte point net ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] NTP vs chrony comparison (Was: oscillations in ntp clock synchronization)
David, We can argue about the Hurst parameter, which can't be truly random-walk as I have assumed, but the approximation is valid up to lag times of at least a week. However, as I have been cautioned, these plots are really sensitive to spectral lines due to nonuniform sampling. I was very careful to avoid such things. Dave David Malone wrote: Unruh [EMAIL PROTECTED] writes: weekends. Lots of power at 10^-5 Hz and harmonics, and .7 10^-8Hz.-- more than would be predicted by 1/f 10^-5Hz is about once per day. I'm not sure what .7 10^8Hz is - it seems to be about once every 4.5 years? I would have assumed you'd get power around 10^-5Hz (daily), 10^-6 Hz (weekly) and maybe 3x10^-8 (yearly) based on a mix of enviromental factors (air conditioning/heating) and usage? David. ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
On Sun, 20 Jan 2008 17:50:41 GMT, Unruh [EMAIL PROTECTED] wrote for the entire planet to see: [EMAIL PROTECTED] (David Woolley) writes: In article [EMAIL PROTECTED], Unruh [EMAIL PROTECTED] wrote: snip I would assume that ntp is giving these samples with long round trip very low weight, or even eliminating them. Note: if these spikes are positive, they may be the result of lost ticks. Don't think so. I think they are 5-10ms transmission delays. The delays disappear if I run at maxpoll 7 rather than 10, so I suspect the router is forgetting the addresses and taking its own sweet time about finding them if the time between transmissions is many minutes. chrony has a nice feature of being able to send an echo datagram to the other machine if you want (before the ntp packet), to wake up the routers along the way. There are several related effects here that I have experienced in my NTP network. First is the possible ARP resolution overheads. If the IP addresses of your host and of the destination or default gateway are not passing traffic frequently the ARP cache in your host or the local router can time out and need to be reloaded on each poll. These can be on the order of 5-10ms and will affect only one side of the transaction's transmission delay. Unfortunately ARP often uses a 15 minute TTL, and default NTP uses a 17 minute poll interval. Then there is the whole problem that many routers all along the path experience extra overhead on the first packet of a flow. Route table look ups are done by destination IP of course, but generally have to be installed into the cache, or FIB, the first time a new source/dest IP pair shows up. This is often a 1-3ms overhead. And that entry doesn't last forever either. Then there is the MAC cache in your switches, which generally purge after 1-5 minutes. This can often be adjusted higher, but that can sometimes cause issues for others when they are reconfiguring part of the network. Another issue is NATing or statefull firewalls. There is often outbound (or inbound) connection setup time. Without special configuration this often times out before twenty minutes, leading to more asymmetric delay. I think the suggestion of a pre-poll ICMP echo is kinda interesting. It might be possible to limit the packet TTL to five hops or so, just warming up your side of the network. It might also be better to make it a mostly standard UDP NTP packet so it matches whatever rules the intermediate devices are applying (and you want them to remember). QoS and policy routing are both sensitive to port numbers, and certainly most firewalls are protocol sensitive, so matching the initial packet attributes to the desired high-performance packet attributes would probably help this technique work. To mitigate some of these effects it might not have to be done that often. In many hierarchical network topologies it might serve just to send one extra packet every 3-5 minutes using the same source IP/port that NTP normally uses, to any configured server. And it could still have a limited TTL if desired. That would at least keep the switch and ARP caches fresh and depending on the design, the policy and NAT caches as well. - Eric ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] strange behaviour of ntp peerstats entries.
Unruh wrote: I am also a little bit surprized that it is the delay that is used and not the total roundtrip time. As I seem to read it, the delay is (t4-t3+t2-t1) ie, it does not take into account the delay within the far machinei (eg t4-t1), but only propagation delay. I would expect that the former might even be more important than the latter, but that is a pure guess-- ie no measurements on even one system to back it up. Now it may be that on that rocky road to Manila, the propagation delay is by far the most important, but on a moderm lan, especially with a low propagation delay of hundreds of usec rather then 100s of msec, I wonder. The calculation of the offset is not affected at all by the time between taking the two timestamps on the remote machine. In fact, on symmetric peers, this time could be in the thousands of seconds or more. Brian Utterback ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Rick Jones wrote: Eric [EMAIL PROTECTED] wrote: Then there is the MAC cache in your switches, which generally purge after 1-5 minutes. This can often be adjusted higher, but that can sometimes cause issues for others when they are reconfiguring part of the network. I suppose if STP gets involved, but just on its own, the forwarding table in a switch being aged should only mean that the next frame to that MAC will go out all (enabled) ports on the switch until that MAC is seen again as a source. That shouldn't affect timing really. rick jones There is some blurring of device type going on. The problem is not which port to send to, but rather which MAC address to send to. This is more of a problem with routers than switches, but with VLANS and whatever these days, the device uin question might be both. In any case, if the needed MAC is not available, there has to be an ARP request and response before the packet can be sent, but this delay is not evident in the return trip for the NTP response packet, introducing an asymmetric delay, the worst thing that happen to NTP. I reported this problem many years ago and suggested using burst at that time, but thought that it would be overkill and asked for a way to tune it to a fewer number of packets in the burst. Dave was reticent and I was newer to the project then and didn't want to push it. Perhaps it is time. Brian Utterback ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
[ntp:questions] NTP Statistics
Hi, I am using ntpv4.2.0 and have a question about the system statistics interpretation on my NTP server. The ntp page on monitoring options lists 11 'system stat' fields as follows. http://www.eecis.udel.edu/~mills/ntp/html/monopt.html MJD date time past midnight time since restart packets received last hour server packets received last hour current version packets last hour previous version packets last hour access denied packets last hour bad length or format packets last hour bad authentication packets last hour rate exceeded packets last hour A sample of my daily filegen output is as follows (12 fields): 54493 514.622 117 12 3 12 0 0 0 0 0 0 54493 4118.319 118 14 2 14 0 0 0 0 0 0 54493 7722.012 119 13 3 13 0 0 0 0 0 0 It is critical for me to understand the meaning of each of the stats to properly monitor my NTP deployment. Can someone please point me to more detailed descriptions or maybe just confirm and comment on my guesses below. My interpretation of the stats meaning (for my 1st line output listed above) is as below: 54493 - MJD date 514.622 - UTC time past midnight in seconds 117 - time since restart (in hours? or is this just a record count?) 12 - packets received last hour (NTP req packets from clients last hour) 3 - Server Packets received last hour (packets from other servers??) 12 - current version packets last hour (NTP req packets from clients using same version of NTP) 0 - previous version packets last hour 0 - access denied packets last hour (not allowed to synchronize with me???) 0 - bad length or format packets last hour 0 - bad authentication packets last hour (bad MD5 check??) 0 - rate exceeded packets last hour (exceeded the min poll rate or some such?) 0 - extra field not described on web Most critical for me is that I understand that packets received last hour is really a count of NTP requests from clients. I want to use this to get a rough idea of the load on my server to use for scaling and monitoring. thanks, Steve ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Eric [EMAIL PROTECTED] wrote: Then there is the MAC cache in your switches, which generally purge after 1-5 minutes. This can often be adjusted higher, but that can sometimes cause issues for others when they are reconfiguring part of the network. I suppose if STP gets involved, but just on its own, the forwarding table in a switch being aged should only mean that the next frame to that MAC will go out all (enabled) ports on the switch until that MAC is seen again as a source. That shouldn't affect timing really. rick jones -- a wide gulf separates what if from if only these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH... ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Dave, The problem Eric describes would seem to be most evident for LAN clients, and on networks where there is already lots of traffic. Recommending 'burst' for this case seems to me (without any experimental evidence or even a read thru the code) to be counter-productive. If a burstsize of 2 would be sufficient to address Eric's problem, I'd be game for adding a 'shortburst' flage to handle that. My thought is that while a 'burstsize N' option would be more flexible, it could be abused too easily. H In article [EMAIL PROTECTED], David L. Mills [EMAIL PROTECTED] writes: David Eric, Many years ago the Proteon routers dropped the first packet David after the cache timed out; that was a disaster. That case and the David ones you describe are exactly what the NTP burst mode is designed David for. The first packet in the burst carves the caches all along the David route and back. The clock filter algorithm tosses it out in favor of David the remaining packets in the burst. No ICMP is needed or wanted. David Dave David Eric wrote: On Sun, 20 Jan 2008 17:50:41 GMT, Unruh [EMAIL PROTECTED] wrote for the entire planet to see: [EMAIL PROTECTED] (David Woolley) writes: In article [EMAIL PROTECTED], Unruh [EMAIL PROTECTED] wrote: snip I would assume that ntp is giving these samples with long round trip very low weight, or even eliminating them. Note: if these spikes are positive, they may be the result of lost ticks. Don't think so. I think they are 5-10ms transmission delays. The delays disappear if I run at maxpoll 7 rather than 10, so I suspect the router is forgetting the addresses and taking its own sweet time about finding them if the time between transmissions is many minutes. chrony has a nice feature of being able to send an echo datagram to the other machine if you want (before the ntp packet), to wake up the routers along the way. There are several related effects here that I have experienced in my NTP network. First is the possible ARP resolution overheads. If the IP addresses of your host and of the destination or default gateway are not passing traffic frequently the ARP cache in your host or the local router can time out and need to be reloaded on each poll. These can be on the order of 5-10ms and will affect only one side of the transaction's transmission delay. Unfortunately ARP often uses a 15 minute TTL, and default NTP uses a 17 minute poll interval. Then there is the whole problem that many routers all along the path experience extra overhead on the first packet of a flow. Route table look ups are done by destination IP of course, but generally have to be installed into the cache, or FIB, the first time a new source/dest IP pair shows up. This is often a 1-3ms overhead. And that entry doesn't last forever either. Then there is the MAC cache in your switches, which generally purge after 1-5 minutes. This can often be adjusted higher, but that can sometimes cause issues for others when they are reconfiguring part of the network. Another issue is NATing or statefull firewalls. There is often outbound (or inbound) connection setup time. Without special configuration this often times out before twenty minutes, leading to more asymmetric delay. I think the suggestion of a pre-poll ICMP echo is kinda interesting. It might be possible to limit the packet TTL to five hops or so, just warming up your side of the network. It might also be better to make it a mostly standard UDP NTP packet so it matches whatever rules the intermediate devices are applying (and you want them to remember). QoS and policy routing are both sensitive to port numbers, and certainly most firewalls are protocol sensitive, so matching the initial packet attributes to the desired high-performance packet attributes would probably help this technique work. To mitigate some of these effects it might not have to be done that often. In many hierarchical network topologies it might serve just to send one extra packet every 3-5 minutes using the same source IP/port that NTP normally uses, to any configured server. And it could still have a limited TTL if desired. That would at least keep the switch and ARP caches fresh and depending on the design, the policy and NAT caches as well. - Eric -- Harlan Stenn [EMAIL PROTECTED] http://ntpforum.isc.org - be a member! ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] NTP vs chrony comparison (Was: oscillations in ntp clock synchronization)
Maarten, Maybe I didn't make myself clear. The case in question is when the intrinsic frequency error of the computer clock is greater than 500 PPM, in which case the discipline loop cannot compensate for the error. The result is a systematic time offset error that cannot be driven to zero. This has nothing to do with the initial offset as you suggest. Dave Maarten Wiltink wrote: Unruh [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] David L. Mills wrote: Unless the computer clock intrinsic frequency error is huge, the only time the 500-PPM kicks in is with a 100-ms step transient and poll interval 16 s. The loop still works if it hits the stops; it just can't drive the offset to zero. [...] Why can't it drive the offset to zero? 100ms should take about 5 min(if it were always 500 but the loop would make it take longer) That would presumably be in the case of 'huge intrinsic frequency error'. Groetjes, Maarten Wiltink ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Eric [EMAIL PROTECTED] wrote: You are probably right about the MAC cache miss not affecting timing. You are the resident switch guru here. Scary thought :) Of course, different manufacturers may have different methods of detecting the cache miss and recovering from that, so it would be hard to eliminate that effect from consideration entirely. It's the smallest effect of all the ones I've dealt with. Just to be certain, you are talking about MAC's being aged out of a switch's forwarding tables right? I interpreted it that way based on the previous text discussing ARP caches. rick jones -- The glass is neither half-empty nor half-full. The glass has a leak. The real question is Can it be patched? these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH... ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
On 2008-01-28, David L. Mills [EMAIL PROTECTED] wrote: Eric wrote: [---=| TOFU protection by t-prot: 72 lines snipped |=---] That case and the ones you describe are exactly what the NTP burst mode is designed for. The first packet in the burst carves the caches all along the route and back. The clock filter algorithm tosses it out in favor of the remaining packets in the burst. No ICMP is needed or wanted. Burst sends 8x packets to the remote time server at each poll interval. This greatly increases the load posed any one client. Perhaps it may be useful to allow the user to specify a smaller number of packets. -- Steve Kostecke [EMAIL PROTECTED] NTP Public Services Project - http://support.ntp.org/ ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
On Mon, 28 Jan 2008 19:19:12 +, David L. Mills [EMAIL PROTECTED] wrote for the entire planet to see: Eric, Many years ago the Proteon routers dropped the first packet after the cache timed out; that was a disaster. That case and the ones you describe are exactly what the NTP burst mode is designed for. The first packet in the burst carves the caches all along the route and back. The clock filter algorithm tosses it out in favor of the remaining packets in the burst. No ICMP is needed or wanted. Dave I agree about ICMP. UDP would be better. And BURST / IBURST are nice, but conventional wisdom has it that BURST really shouldn't be used towards servers that you don't administer, and IBURST will of course not handle the ongoing case. In considering this more, I think a great option or tinker value would be one that simply sends an extra packet, rather than eight of them, and only if the previous poll for that association was sent more than x seconds ago. In other words, as long as the poll value is say 7 or less, nothing new is needed. When the poll exceeds 7, then ten seconds before a poll is due to be sent an explorer poll is sent (and any response would likely be discarded). EBURST, or maybe PAVE. - Eric ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Brian Utterback [EMAIL PROTECTED] wrote: Rick Jones wrote: Eric [EMAIL PROTECTED] wrote: Then there is the MAC cache in your switches, which generally purge after 1-5 minutes. This can often be adjusted higher, but that can sometimes cause issues for others when they are reconfiguring part of the network. I suppose if STP gets involved, but just on its own, the forwarding table in a switch being aged should only mean that the next frame to that MAC will go out all (enabled) ports on the switch until that MAC is seen again as a source. That shouldn't affect timing really. rick jones There is some blurring of device type going on. The problem is not which port to send to, but rather which MAC address to send to. This is more of a problem with routers than switches, but with VLANS and whatever these days, the device uin question might be both. I interpreted Eric's text differently. Since a device acting as a switch is only operating at layer 2, it doesn't do any lookups on what the destination MAC should be. Indeed, a device operating as a router could be doing an ARP lookup, but I ass-u-me-d that was covered by a prior paragraph of Eric's. In any case, if the needed MAC is not available, there has to be an ARP request and response before the packet can be sent, but this delay is not evident in the return trip for the NTP response packet, introducing an asymmetric delay, the worst thing that happen to NTP. I reported this problem many years ago and suggested using burst at that time, but thought that it would be overkill and asked for a way to tune it to a fewer number of packets in the burst. Dave was reticent and I was newer to the project then and didn't want to push it. Perhaps it is time. I'll probably quite easily display my profound NTP ignorance here :) But if there is assymetric delay stemming from an ARP resolution, won't it affect all the packets in the burst? Unless the tranmission time of the burst getting out of NTP is the ARP resolution time, the entire burst is going to be blocked waiting on the ARP resolution. Now, if this burst was really send a couple; wait for a reply; send a couple more then one might ass-u-me (I do love that spelling :) that the couple more didn't have ARP-induced assymetry. rick jones I probably tweak on switch vs router much the same way an NTP person tweaks on accuracy vs precision :) -- web2.0 n, the dot.com reunion tour... these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH... ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
On Mon, 28 Jan 2008 21:39:17 + (UTC), Rick Jones [EMAIL PROTECTED] wrote for the entire planet to see: Eric [EMAIL PROTECTED] wrote: Of course, different manufacturers may have different methods of detecting the cache miss and recovering from that, so it would be hard to eliminate that effect from consideration entirely. It's the smallest effect of all the ones I've dealt with. Just to be certain, you are talking about MAC's being aged out of a switch's forwarding tables right? I interpreted it that way based on the previous text discussing ARP caches. Yup. And I see that in the simple case the packet just floods, isn't delayed on its original path/port, and the MAC cache update is handled overlapped in time with the packet transfer. But, there may be more complicated cases; you mentioned STP, and of course flooding causes its own delays to some degree. Then it might be that the switch firmware takes a slow path on the cache miss, causes an interrupt, gets scheduled into a timeslice, updates the MAC Cache, and then redrives the packet forwarding process. Not ideal, if that ever happens. - Eric ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Eric wrote: I'm pleased to know I've provoked some new thoughts. If I understand your post, burst mode was intended to get enough (lousy) samples into and through the clock filters to allow for initial sync. Once the pipeline is loaded no more extra polls are needed. That's iburst, not burst. ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Eric, Good suggestion. In either burst mode the code sends a single packet at each poll interval, but sends the remaining packets only after receiving a response; packets in the burst are sent at 2-s intervals. The most cautious can set the headway to 512 s, in which a single packet is sent at that interval or less, but two packets at 1024 only upon reeiving aresponse for the first. Burst mode is of course not intended for busy servers, much less the national standards servers. It is inended for paths involving lossy, low speed nets with poll intervals 1024 s or more. The young folks among us might not remember (or even be alive) when the Internet was new and paths to Canada had delays up to several seconds and loss rates up to ten percent. However, you give me an idea. Why not shut down the burst when the clock filter delivers the first sample? Gotta think about that. Dave Eric wrote: On Mon, 28 Jan 2008 19:19:12 +, David L. Mills [EMAIL PROTECTED] wrote for the entire planet to see: Eric, Many years ago the Proteon routers dropped the first packet after the cache timed out; that was a disaster. That case and the ones you describe are exactly what the NTP burst mode is designed for. The first packet in the burst carves the caches all along the route and back. The clock filter algorithm tosses it out in favor of the remaining packets in the burst. No ICMP is needed or wanted. Dave I agree about ICMP. UDP would be better. And BURST / IBURST are nice, but conventional wisdom has it that BURST really shouldn't be used towards servers that you don't administer, and IBURST will of course not handle the ongoing case. In considering this more, I think a great option or tinker value would be one that simply sends an extra packet, rather than eight of them, and only if the previous poll for that association was sent more than x seconds ago. In other words, as long as the poll value is say 7 or less, nothing new is needed. When the poll exceeds 7, then ten seconds before a poll is due to be sent an explorer poll is sent (and any response would likely be discarded). EBURST, or maybe PAVE. - Eric ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
The burst is sent at 1 second intervals. There should be lots of time for all the switches and routers to get their act in gear. Ah, well chalk that one up to me being picky (perhaps even wrong :) about network terminology then :) I always think of a burst as a series of packets sent back-to-back. insert suitable Emily Litella quote here rick jones -- No need to believe in either side, or any side. There is no cause. There's only yourself. The belief is in your own precision. - Jobert these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH... ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] NTP Statistics
In article [EMAIL PROTECTED], Steve Pearson [EMAIL PROTECTED] writes: Steve Hi, I am using ntpv4.2.0 and have a question about the system Steve statistics interpretation on my NTP server. Steve The ntp page on monitoring options lists 11 'system stat' fields as Steve follows. http://www.eecis.udel.edu/~mills/ntp/html/monopt.html Those are Dave's pages, and they reflect the latest -dev code. For now, the best thing for you to do is get the source code for the version you are running and look at the code. The NTP Forum has projects listed to improve the documentation: http://ntpforum.isc.org/Main/ForumProject3 http://ntpforum.isc.org/Main/ForumProject4 Another thing the NTP Forum will be helping support.ntp.org have web-searchable documentation for different versions of NTP. A significant purpose of the NTP Forum is to find the places where NTP is giving you headaches and then making those headaches go away. I'm eager to work with folks to get their companies signed up as institutional members in the NTP Forum so we can get rid of the pain and have significantly better lives where NTP is concerned. -- Harlan Stenn [EMAIL PROTECTED] http://ntpforum.isc.org - be a member! ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Hal Murray wrote: I'll probably quite easily display my profound NTP ignorance here :) But if there is assymetric delay stemming from an ARP resolution, won't it affect all the packets in the burst? Unless the tranmission time of the burst getting out of NTP is the ARP resolution time, the entire burst is going to be blocked waiting on the ARP resolution. The burst is sent at 1 second intervals. There should be lots of time for all the switches and routers to get their act in gear. I thought that burst sent eight packets two seconds apart at each poll interval. It's not apprropriate for most situations. It was designed for systems making infrequent dialup connections like twice or three times daily. ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
On Mon, 28 Jan 2008 22:09:11 +, David L. Mills [EMAIL PROTECTED] wrote for the entire planet to see: snip However, you give me an idea. Why not shut down the burst when the clock filter delivers the first sample? Gotta think about that. Dave Hi Dave - I'm pleased to know I've provoked some new thoughts. If I understand your post, burst mode was intended to get enough (lousy) samples into and through the clock filters to allow for initial sync. Once the pipeline is loaded no more extra polls are needed. But the rest of this sub-thread was about poll intervals that get so large that the intervening equipment forgets about the flow and always, from then on, gives lousy performance on the one and only poll in that interval. I guess we could kill two birds with one stone and shut down burst as you suggest, until the interval gets longer, when it could make a reappearance, perhaps as only a pair of packets. - Eric ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Eric wrote: On Mon, 28 Jan 2008 22:09:11 +, David L. Mills [EMAIL PROTECTED] wrote for the entire planet to see: snip However, you give me an idea. Why not shut down the burst when the clock filter delivers the first sample? Gotta think about that. Dave Hi Dave - I'm pleased to know I've provoked some new thoughts. If I understand your post, burst mode was intended to get enough (lousy) samples into and through the clock filters to allow for initial sync. Once the pipeline is loaded no more extra polls are needed. I think you are confusing iburst and burst. Iburst is used at startup to fill the pipeline and get a fast startup. Following that initial burst, a single request packet is sent at each poll interval. Burst mode is used in situations where a system connects to a server at intervals measured in hours; e.g. two to four times per day. The samples are not necessarily lousy, they are just obtained infrequently. Eight samples fill the pipeline and satisfy the filter. ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] strange behaviour of ntp peerstats entries.
Unruh, It would seem self evident from the equations that minimizing the delay variance truly does minimize the offset variance. Further evidence of that is in the raw versus filtered offset graphs in the architecture briefings. If nothing else, the filter reduces the variance by some 10 dB. More to the point, emphasis added, the wedge scattergrams show just how good the filter can be. It selects points near the apex of the wedge, the others don't matter. You might argue the particular clock filter algorithm could be improved, but the mission in any case is to select the points at or near the apex. While the authors might not have realized it, the filter method you describe is identical to Cristian's Probabilistic Clock Synchronization (PCS) methiod described in the literature some years back. The idea is to discard the outlyer delays beyond a decreasing threshold. In other words, the tighter the threshold, the more outlyers are tossed out, so you strike a balance. I argued then and now that it is better to select the best from among the samples rather than to selectively discard the outlyers. There may be merit in an arugment that says the points along the limbs of the wedge are being ignored. In principle, these points can be found using a slective filter that searches for an offset/delay ration of 0.5, which in fact is what the huff-n'-puff filter does. To do this effectively you need to know the baseline propagation delay, which is also what the huff-n'-puff filter does. Experiments doing this with symmetric delays, as agains the asymmetric delays the huff-n'-puff filter was designed for were inconclusive. Dave Unruh wrote: snip Oh yes. popcorn suppression is important. I agree. But the filter goes well beyond that. My eaction is that on the one hand people keep saying how important net load is, and that one does not want to use poll intervals that are much smaller than 8 or 10, and on the other hand, throwing away 80-90% of the data collected. Remin ds me of the story of Saul, king of the Israelites, whose army was besieged, and he mentioned that he was thirsty. A few of his soldiers risked everything to get through the enemy lines and bring him water. He was so impressed that he poured it all out on the ground, in tribute to their courage. I have always found that story an incredible insult to the bravery instead. The procedure does drastically reduce the variance of the delay, but does not much for the variance of the offset, which is of coure what is important. Just to bring up chrony again, it uses both a suppression where round trips greater than say 1.5 of min are discarded, and data is weighted by some power of the invere of the delay. snip ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Steve, You might have missed my message about rate control on the hackers list. The average headway for all packets, including those in a burst is strictly controlled at 16 s. So, 1 packet in a burst at 16 s, 2 at 32 s, 4 at 64 and 8 at 128 s and higher. The default average headway can be set by a configuration command. The scheme is specifically designed for long, noisy Internet paths and large poll intervals where the clock filter is most effective and also for cases involving dialup links with highly variable call setup delays. The fact that it trounces ARP caches is a secondary benefit. ICMP pings will not work to our campus machines from outside. ICMP request messages are dropped by the ingress router. Dave Steve Kostecke wrote: On 2008-01-28, David L. Mills [EMAIL PROTECTED] wrote: Eric wrote: [---=| TOFU protection by t-prot: 72 lines snipped |=---] That case and the ones you describe are exactly what the NTP burst mode is designed for. The first packet in the burst carves the caches all along the route and back. The clock filter algorithm tosses it out in favor of the remaining packets in the burst. No ICMP is needed or wanted. Burst sends 8x packets to the remote time server at each poll interval. This greatly increases the load posed any one client. Perhaps it may be useful to allow the user to specify a smaller number of packets. ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
On Mon, 28 Jan 2008 17:44:08 -0500, Richard B. Gilbert [EMAIL PROTECTED] wrote for the entire planet to see: I thought that burst sent eight packets two seconds apart at each poll interval. It's not apprropriate for most situations. It was designed for systems making infrequent dialup connections like twice or three times daily. My confusion. IBURST for the Initial loading of the buffer. BURST for the very, very infrequent connection to reload the entire buffer each time. But what about the idea that IBURST is nice for fast startup, BURST is helpful if there hasn't been a poll for a very, very long time, and now the new idea for an explorer packet (only one extra) that would be nice to smooth the network path when the polling interval goes over a couple of minutes. It turns each of them into virtually the same case, classified by when the polling interval currently in effect. - Eric ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Eric, There are actually two burst modes IBURST when the server is unreachable and BURST when it is. They are independent of each other and both can be used at the same time. Currently, IBURST uses 6 packets, as that is a couple more than to pass the distance threshold and synchronize the clock. This actually is recommended; the following packets whether burst or not are delayed so that the average headway does not exceed the specified threshold, by default 16 s. The BURST mode also obeys the headway restrictions, but is intended to de-jitter in the cases I mentioned. What set off my bell in response to your remark was an interesting observation when watching the clock filter operate. Start the daemon with a -d flag and watch the clock_filter and local_clock traces. Notice that there are often several samples discarded as not younger than the last used sample. This is a normal situation; however, it reveals that the probability of using another sample just after usin one is relatively low. In other words, when you find a sample you might as well give up and wait for the next burst. This needs to be confirmed. Dave Eric wrote: On Mon, 28 Jan 2008 22:09:11 +, David L. Mills [EMAIL PROTECTED] wrote for the entire planet to see: snip However, you give me an idea. Why not shut down the burst when the clock filter delivers the first sample? Gotta think about that. Dave Hi Dave - I'm pleased to know I've provoked some new thoughts. If I understand your post, burst mode was intended to get enough (lousy) samples into and through the clock filters to allow for initial sync. Once the pipeline is loaded no more extra polls are needed. But the rest of this sub-thread was about poll intervals that get so large that the intervening equipment forgets about the flow and always, from then on, gives lousy performance on the one and only poll in that interval. I guess we could kill two birds with one stone and shut down burst as you suggest, until the interval gets longer, when it could make a reappearance, perhaps as only a pair of packets. - Eric ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] NTP Statistics
Steve, The best place to check the data is in the ntp_util.c file, record_sys_stats() routine. I recently added another stat, but you might not be using the most recent version. The time since startup is in hours. The packets received are the total number of packets received. The server packets received are in response to packets sent from an association on your machine. There are so many little trails in and out of the machine, like control/monitor packets, etc., and so many little ways a packet can be dropped, that the counters might not catch each and every wee thing. Dave Steve Pearson wrote: Hi, I am using ntpv4.2.0 and have a question about the system statistics interpretation on my NTP server. The ntp page on monitoring options lists 11 'system stat' fields as follows. http://www.eecis.udel.edu/~mills/ntp/html/monopt.html MJD date time past midnight time since restart packets received last hour server packets received last hour current version packets last hour previous version packets last hour access denied packets last hour bad length or format packets last hour bad authentication packets last hour rate exceeded packets last hour A sample of my daily filegen output is as follows (12 fields): 54493 514.622 117 12 3 12 0 0 0 0 0 0 54493 4118.319 118 14 2 14 0 0 0 0 0 0 54493 7722.012 119 13 3 13 0 0 0 0 0 0 It is critical for me to understand the meaning of each of the stats to properly monitor my NTP deployment. Can someone please point me to more detailed descriptions or maybe just confirm and comment on my guesses below. My interpretation of the stats meaning (for my 1st line output listed above) is as below: 54493 - MJD date 514.622 - UTC time past midnight in seconds 117 - time since restart (in hours? or is this just a record count?) 12 - packets received last hour (NTP req packets from clients last hour) 3 - Server Packets received last hour (packets from other servers??) 12 - current version packets last hour (NTP req packets from clients using same version of NTP) 0 - previous version packets last hour 0 - access denied packets last hour (not allowed to synchronize with me???) 0 - bad length or format packets last hour 0 - bad authentication packets last hour (bad MD5 check??) 0 - rate exceeded packets last hour (exceeded the min poll rate or some such?) 0 - extra field not described on web Most critical for me is that I understand that packets received last hour is really a count of NTP requests from clients. I want to use this to get a rough idea of the load on my server to use for scaling and monitoring. thanks, Steve ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] very slow convergence of ntp to correct time.
Hal, Not any more. Current NTPv4 sends the burst at 2-s intervals, mainly to coordinate with Autokey opportunities and reduce the total number of packets. Dave Hal Murray wrote: I'll probably quite easily display my profound NTP ignorance here :) But if there is assymetric delay stemming from an ARP resolution, won't it affect all the packets in the burst? Unless the tranmission time of the burst getting out of NTP is the ARP resolution time, the entire burst is going to be blocked waiting on the ARP resolution. The burst is sent at 1 second intervals. There should be lots of time for all the switches and routers to get their act in gear. ___ questions mailing list questions@lists.ntp.org https://lists.ntp.org/mailman/listinfo/questions
Re: [ntp:questions] strange behaviour of ntp peerstats entries.
David L. Mills [EMAIL PROTECTED] writes: Unruh, It would seem self evident from the equations that minimizing the delay variance truly does minimize the offset variance. Further evidence of that is in the raw versus filtered offset graphs in the architecture briefings. If nothing else, the filter reduces the variance by some 10 dB. More to the point, emphasis added, the wedge scattergrams show just I guess then I am confused because my data does not support that. While the delay variance IS reduced, the offset variance is not. The correleation between dely and offset IS reduced by a factor of 10, but the clock variance is reduced not at all. Here are the results from one day gathered brom one clock (I had ntp not only print out the peer-offset peer-delay as it does in the record_peer_stats , but also the p_offset and p_del, the offset and delays calculated for each packet. I alsy throw out the outliers ( for some reason the system would all of a sudden have packets with were 4ms round trip, rather than 160usec. These popcorn spikes are clearly bad. The difference between the variance as calculated from the peer-offset values, and the p_offset values was .5995 (p_offset with del spikes greater than .0003 eliminated) .6017 (peer-offset std dev ) .07337 (p_delay standard deviation, with the greater than .0003 spikes removed) .05489 (peer-delay std dev) (Note that if those popcorn spikes had not been removed, the std dev of the p_offset and p_delay would have been much larger). Ie, it makes no difference at all to the offset std dev, but a significant one to the delay, (Yes, the precision I quote the numbers at is far greater than the accurasy) This is throwing away 83% of the data in the peer- case. Note that this is one machine on one day, etc. and well after the startup transients had disappeared. how good the filter can be. It selects points near the apex of the wedge, the others don't matter. You might argue the particular clock filter algorithm could be improved, but the mission in any case is to select the points at or near the apex. While the authors might not have realized it, the filter method you describe is identical to Cristian's Probabilistic Clock Synchronization (PCS) methiod described in the literature some years back. The idea is I have no idea if Curnoe knew that. The majority of his code was written 10 years ago, not recently. He uses only the inverse of the delay as the weights I believe, with a user adjustable parameter to throw away delays which are too large. to discard the outlyer delays beyond a decreasing threshold. In other words, the tighter the threshold, the more outlyers are tossed out, so you strike a balance. I argued then and now that it is better to select the best from among the samples rather than to selectively discard the outlyers. There may be merit in an arugment that says the points along the limbs of the wedge are being ignored. In principle, these points can be found using a slective filter that searches for an offset/delay ration of 0.5, which in fact is what the huff-n'-puff filter does. To do this effectively you need to know the baseline propagation delay, which is also what the huff-n'-puff filter does. Experiments doing this with symmetric delays, as agains the asymmetric delays the huff-n'-puff filter was designed for were inconclusive. But from what I see of the code, the huff-n-puff occurs after 80% have already been discarded by the clock_filter If data were cheap, ( and I think that in most cases today it is) then throwing away 80% is fine. There is lots more out there. But this profligacy in the treatment of the data sits uncomfortably with the competing claim that collecting data is precious -- you should never use maxpoll less than 7, you should bother the ntp servers as little as possible. That makes the data precious. You cannot simply go out and collect all you want. Then throwing it away seems a bad idea to me. Dave Unruh wrote: snip Oh yes. popcorn suppression is important. I agree. But the filter goes well beyond that. My eaction is that on the one hand people keep saying how important net load is, and that one does not want to use poll intervals that are much smaller than 8 or 10, and on the other hand, throwing away 80-90% of the data collected. Remin ds me of the story of Saul, king of the Israelites, whose army was besieged, and he mentioned that he was thirsty. A few of his soldiers risked everything to get through the enemy lines and bring him water. He was so impressed that he poured it all out on the ground, in tribute to their courage. I have always found that story an incredible insult to the bravery instead. The procedure does drastically reduce the variance of the delay, but does not much for the variance of the offset, which is of coure what is important. Just to bring up chrony again, it uses both a suppression where round trips greater than say 1.5