Re: NTPsec panic and abort

2022-03-18 Thread Hal Murray via devel
Interesting.  Thanks.

That exit is what happens if you try to adjust the time by too large a step.  
It's just a sanity check -- assuming that exiting ntpd is better than making a 
large adjustment.

I forget what the default max-step is.  You can change it via the config file. 
 You can bypass that check for the first adjustment with a -g on the command 
line.

>> Mar 18 05:10:10 gw1 ntpd[2200]: CLOCK: Panic: offset too big: -604800.000
What time zone is your logging using?

>> 59655 86030.616 NMEA(0) $GPRMC,235350,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__

> Note the spurious "100322" date that is 1 week in the past.

I don't see a "100322".  I see 17 rather than 10.

I'm assuming you copied the wrong chunk from the clockstats file.


> I have never come across such an ntpd abort before. This is the first time
> I'm seeing it.
Mostly, GPS units don't lie.

> Can this condition be handled in any other way, so that the service doesn't
> terminate?

How should ntpd decide if the GPS is lying or the time really is off by a week?


Do you have any other servers in your config file?

If there are several working servers, they should out-vote a lying GPS.
But the GPS has a prefer so I'm not sure what would happen.

You are using the PPS in the NMEA driver.  I don't think that needs the prefer.

I usually run with a separate PPS driver so I get the statistics from the PPS 
driver.  That case does need the prefer.


-- 
These are my opinions.  I hate spam.



___
devel mailing list
devel@ntpsec.org
https://lists.ntpsec.org/mailman/listinfo/devel


FWD: NTPsec panic and abort

2022-03-18 Thread Hal Murray via devel


--- Forwarded Message

Date: Fri, 18 Mar 2022 06:02:51 +0530
From: Mukund Sivaraman 
To: hmur...@megapathdsl.net
Subject: NTPsec panic and abort
Message-ID: 


- --q9OuToa696kGPIE0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi Hal

I apologize for emailing you directly instead of creating an NTPsec
issue but I am currently not able to login into my GitLab account.

I am reporting an ntpd abort/crash. It is from a stock Fedora RPM:

> [muks@gw1 ~]$ rpm -q ntpsec
> ntpsec-1.2.1-4.fc35.x86_64
> [muks@gw1 ~]$=20

The computer has a Garmin 18x LVC GPS receiver device hooked up to a
serial port, and ntpd's builtin NMEA driver is used to interface with it
directly. It also has a working PPS signal. The relevant ntp.conf config
lines are:

> server 127.127.20.0 mode 1 prefer minpoll 4
> fudge 127.127.20.0 flag1 1 flag2 0 flag3 0 flag4 1 time2 0.5100621

This is how it looks normally (the device's datasheet claims 1us
accuracy):

> [muks@gw1 ~]$ ntpq -np
>  remote refid  st t when poll reach   delay  =
 offset   jitter
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> oNMEA(0)   .GPS.0 l   15   16  377   0.  =
 0.0002   0.0003

> [muks@gw1 ~]$ ntpq -c clocklist
> associd=3D0 status=3D no events, clk_unspec,
> name=3D"NMEA",
> timecode=3D"$GPRMC,002016,A,.,_,_.,_,000.0,315.7,180322,0=
01.5,W*__",
> poll=3D103, noreply=3D0, badformat=3D0, baddata=3D0, fudgetime2=3D510.062=
, stratum=3D0, refid=3DGPS,
> flags=3D9, device=3D"NMEA GPS Clock"
> [muks@gw1 ~]$=20

I've been running it this way for several years now, previously with the
ntp.org implementation of ntpd, and for a few months now with NTPsec.

The Garmin 18x LVC GPS receiver device stopped working yesterday due to
hardware failure, and I replaced it with another identical unit with the
same firmware version. Within a few hours of running ntpd with the new
device, the ntpd process terminated with the following syslog message:

> Mar 18 05:10:10 gw1 ntpd[2200]: CLOCK: Panic: offset too big: -604800.000
> Mar 18 05:10:10 gw1 systemd[1]: ntpd.service: Main process exited, code=
=3Dexited, status=3D1/FAILURE
> Mar 18 05:10:10 gw1 systemd[1]: ntpd.service: Failed with result 'exit-co=
de'.

It appears that the GPS receiver sent a faulty date in the $GPRMC NMEA
sentence. The following are a sequence of lines from
/var/log/ntpstats/clockstats:

> 59655 85114.640 NMEA(0) $GPRMC,233834,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 85130.640 NMEA(0) $GPRMC,233850,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 85146.640 NMEA(0) $GPRMC,233906,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 85162.640 NMEA(0) $GPRMC,233922,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 85178.640 NMEA(0) $GPRMC,233938,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 85194.640 NMEA(0) $GPRMC,233954,A,.,_,_.,_,000.0,31=
5.7,100322,001.5,W*__
> 59655 85982.616 NMEA(0) $GPRMC,235302,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 85998.616 NMEA(0) $GPRMC,235318,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 86014.616 NMEA(0) $GPRMC,235334,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__
> 59655 86030.616 NMEA(0) $GPRMC,235350,A,.,_,_.,_,000.0,31=
5.7,170322,001.5,W*__

Note the spurious "100322" date that is 1 week in the past. -604800 from
the syslog message is -1 week in seconds (7 * 24 * 3600). Note the ",A,"
is returned in the status (<2>) field in the $GPRMC sentence, by which
the GPS receiver still claims it has the "fix" and a valid position. If
you want a reference for the $GPRMC NMEA sentence for this receiver,
please see page 18 of:

https://static.garmin.com/pumac/GPS_18x_Tech_Specs.pdf

It appears that some bogus condition has occurred within the GPS
receiver and it has sent a spurious $GPRMC sentence. However, it seems
too extreme for ntpd to abort due to this. Could it ignore the sentence
with the big offset instead? The GPS receiver appears to correct itself
eventually. If ntpd aborts, the running NTP service is no longer present
causing other problems.

I have never come across such an ntpd abort before. This is the first
time I'm seeing it.

Can this condition be handled in any other way, so that the service
doesn't terminate?

Mukund

- --q9OuToa696kGPIE0
Content-Type: application/pgp-signature; name="signature.asc"

- -BEGIN PGP SIGNATURE-

iQIzBAEBCgAdFiEEcpanf3Bxi94C0NsVude/iQOlsOwFAmIz0zAACgkQude/iQOl
sOwo3hAAgKsh6EF2mSM/tCew5AnRKAoOu/S5wDEfzJJU9qgLvxVpbEV4U4jGcIyd