Ok, I thought this was a one-off problem but I've had ntpd lose sync again after about four days from a restart. It never regains sync.

It starts with what seems to be the system clock drifting away from the PPS lock and then the oscillations from corrections are just too great and the whole thing blows up.


Here's the current configuration for version 4.2.7p236:

server          0.us.pool.ntp.org minpoll 9 iburst
server          1.us.pool.ntp.org minpoll 9 iburst
server          0.north-america.pool.ntp.org minpoll 9 iburst
server ntp1.gatech.edu prefer minpoll 9
server rolex.usg.edu minpoll 9
server  127.127.22.0  minpoll 2 maxpoll 4
fudge   127.127.22.0  time1 +0.000 flag2 1 flag3 1 refid PPS
server  127.127.28.0  minpoll 7 noselect
fudge   127.127.28.0  time1 -0.6 refid GPSD


The peer list after waiting about a day from the initial system upset:

remote refid st t when poll reach delay offset jitter

==============================================================================
x127.127.22.0 .PPS. 0 l - 16 377 0.000 -465.49 355.933 127.127.28.0 .GPSD. 0 l - 128 377 0.000 -208986 2833.87 207.7.148.214 216.218.254.202 2 u - 512 377 1045.07 -209713 11784.0 72.14.179.211 127.67.113.92 2 u - 512 377 1029.80 -201710 6559.37 173.255.224.22 128.4.1.1 2 u 245 512 377 919.628 -202629 7684.05 130.207.165.28 130.207.244.240 2 u - 512 377 994.543 -204125 7778.28 131.144.4.10 65.212.71.102 2 u 23 512 377 1000.21 -203648 7687.63

Note that the offset for PPS is swinging wildly, not exactly visible in this static snapshot.

ntpq associations:
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1  4560  912a   yes   yes  none falsetick    sys_peer  2
  2  4561  9014   yes   yes  none    reject   reachable  1
  3  4562  9014   yes   yes  none    reject   reachable  1
  4  4563  9034   yes   yes  none    reject   reachable  3
  5  4564  9014   yes   yes  none    reject   reachable  1
  6  4565  904a   yes   yes  none    reject    sys_peer  4
  7  4566  9014   yes   yes  none    reject   reachable  1

rv 4560 (first sys_peer):
 associd=4560 status=912a conf, reach, sel_falsetick, 2 events, sys_peer,
 srcadr=PPS(0), srcport=123, dstadr=127.0.0.1, dstport=123, leap=00,
 stratum=0, precision=-20, rootdelay=0.000, rootdisp=0.000, refid=PPS,
 reftime=d2d76400.c9b870fd  Sat, Feb  4 2012  8:00:00.787,
 rec=d2d76401.ffffffff  Sat, Feb  4 2012  8:00:02.000, reach=377,
 unreach=0, hmode=3, pmode=4, hpoll=4, ppoll=4, headway=0, flash=00 ok,
 keyid=0, offset=259.524, delay=0.000, dispersion=4.956, jitter=444.467,
filtdelay= 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00, filtoffset= 259.52 344.53 419.52 474.51 -430.48 -335.49 -265.48 -185.49, filtdisp= 4.74 4.98 5.22 5.47 5.70 5.94 6.18 6.42

rv 4565 (second sys_peer)
 associd=4565 status=904a conf, reach, sel_reject, 4 events, sys_peer,
 srcadr=ntp1.gatech.edu, srcport=123, dstadr=10.0.0.21, dstport=123,
 leap=00, stratum=2, precision=-20, rootdelay=0.565, rootdisp=24.597,
 refid=130.207.244.240,
 reftime=d2d7609d.0646422f  Sat, Feb  4 2012  7:45:33.024,
 rec=d2d76271.00c7dd3a  Sat, Feb  4 2012  7:53:21.003, reach=377,
 unreach=0, hmode=3, pmode=4, hpoll=9, ppoll=9, headway=46,
 flash=400 peer_dist, keyid=0, offset=-204125.520, delay=994.543,
 dispersion=16.941, jitter=7778.280,
filtdelay= 997.29 999.05 994.54 996.13 994.70 994.38 977.68 995.78, filtoffset= -209351 -206700 -204125 -201435 -198758 -196080 -193475 -190882, filtdisp= 0.08 8.07 15.83 23.94 32.01 40.08 47.91 55.76


I can provide graphs of the offset, dispersion and skew for any of the peers if anyone wants them. The physical GPS itself has been ticking just fine, no apparent issues with its signal to the machine. As far as I can tell from the peers files there is simply a sudden shift away from a nominal few microseconds of offset for the reported PPS. The offset then swings wildly (like a PID loop in oscillation) until I restart ntpd and the system clock is stabilized.

The system sits quietly in a corner of the room. It has no duties other than to run ntpd and gpsd. Whatever monitoring I do is run on other systems (ntpd is polled remotely with ntpq on another system, gpsd status is queried remotely by another system and compiled there). The oscillations happen after a few days but no obvious cron jobs are running at the times that they start. If there's something I can do to instrument ntpd further I can do that and see if I catch the problem.
_______________________________________________
questions mailing list
[email protected]
http://lists.ntp.org/listinfo/questions
  • [ntp:questions... A C

Reply via email to