Re: Neoclock-4X driver removal

2019-08-11 Thread Achim Gratz via devel
Eric S. Raymond via devel writes:
> Achim Gratz via devel :
>> Eric S. Raymond via devel writes:
>> > * It has 2ms jitter, way worse than a cheap GPS these days.
>> 
>> That is actually much better than what most of the cheap GPS deliver
>> when connected over USB.
>
> You may be a bit behind the curve on this.

I love that start… when the cheapest shot is fired right at the
beginning, you just know that there isn't any real argument coming.

> I've measured 1ms jitter with the Macx-1, the device I designed in
> conjuction with Navisys back in 2012.  That was a bog-standard
> GPS+PL2303 design with 1PPS from the engine connected to the DCD 
> line on the PL2303.

True, however the "cheap GPS connected over USB" one can actually buy
are mostly using USB serials that don't even allow DCD to be handed
through and hence have no PPS.  PPS over USB serial is still rare to find.

Specifically for Navilock (which I have experience with), their pucks
with PPS have always a separate PPS line (TTL mostly) and have extra
circuitry that is not populated in the modules without PPS to make it
even more difficult to break out the PPS signal that the uBlox module
inside actually still provides (in case you're wondering, I've actually
succeeded in breaking it out).  The USB serial port is usually directly
from the uBlox module inside, which doesn´t support DCD at all or
otherwise a converter chip that only deals with RX/TX.

For the benefit of other readers: The Macx-1 GR-601W seems no longer
obtainable, but the successor products GR-701W and GR-801W may be.  I've
instead switched to NavSpark mini modules plus an FTDI breakout board
that has the full set of serial signals.

> That's how I know that it already took very little effort to pull down
> that jitter figure seven years ago. Another way to put this is that as
> far back as 2012 you had to be screwing the pooch pretty determindly
> to get as bad as 2ms.

Without PPS, that picture looks not quite as rosy, both with direct
serial and USB serial connection.  Plain USB serial is still noticeably
worse in that instance unless you chose very specific converter/driver
combinations.  Again, I use FTDI mostly where that becomes an issue as
the drivers are consistently well working across all OS.  Prolific
drivers on the other hand are all over the map and you have to be really
careful on Windows, specifically that the correct one is installed and
actually talks to the device.

> There's a realistic prospect of that jitter dropping to 0.25msec as
> people who make USB-to-serial chips stop bothering to support USB
> 1.1.

That's not the real reason and the numbers are wrong, too.  The reason
is that you need a high-speed endpoint to use microframes (which were
only specified with USB 2.0), and the number you are looking for is
125µs (there are eight microframes in a 1ms frame).  You need to provide
separate configurations for high-, full- and (if used at all) low-speed
endpoints anyway, so that the host can pick the one config it can or
wants to deal with; you can easily be USB 1.1 compatible and support
lower latency on the USB 2.0 endpoints at the same time this way.  It is
in fact recommended to provide alternative high-speed configurations
with longer poll rates in order for the host to pick up one that fits
the overall load (interrupt transfers reserve bandwidth on the bus, so
not all wishes can be granted -- one of the reasons certain devices
don't work well across a USB hub).

Most of the popular USB serial converters don't support high-speed
endpoints, whether or not they claim compliance to USB 1.1 or 2.0 or yet
some other version (the PL2303 is one of those).  Some of the FTDI USB
serial interface processors (probably all that support JTAG, but I
haven't trawled through their whole product line) actually can be
configured to be polled on each microframe, but it seems that the Linux
driver still only uses the setting that has them buffer at 1ms (down
from 16ms standard).  In any casev with USB being a host-driven system,
it all comes down to what the driver does and whether it actually uses
the capabilities of the device.  That it would work in principle can
easily be inferred from the fact that a rasPi has lower than 1ms jitter
on the ethernet port (which hangs off the USB 2.0 hub port, which in
turn ties into the USB 2.0 host interface).  That may not all be the
doing of the ethernet driver, though; USB2.0 hubs do a thing called
transaction translation that can yield surprising results with some
driver/device constellations.  Unfortunately I don't have a Pi3 A where
the host port gets broken out into the single USB 2.0 port to directly
interface with where one could try to isolate these effects.

> This may already have happened - I haven't been tracking that
> area closely because 1ms jitter is just bartely low enough not to be a
> real problem for an NTP source exoecteds to deliver WAN
> synchronization.

So, what makes 2ms a number that lets you throw a driver 

Re: Driver strategy - we need to decide among incompatible goals

2019-08-11 Thread Achim Gratz via devel
Eric S. Raymond via devel writes:
> You've forgotten much, then. I remind you of the Type 2 Bancomm, the
> Type 45 Spectracom TSync PCI, and the Type 16 Bancomm GPS/IRIG
> Receiver.

Type 2 was actually the venerable Trak GPS, which was available in a
number of output configurations.  NTPd probably only supported PPS/RS232
and as such would not have needed any blobs.  Both the Spectracomm and
Bancomm you cite (which I've never seen in real life) are add-in cards
(PCI in one case and VME in the other), so they would have needed a
device driver in the kernel and likely (based on some SMPTE hardware
from around the same timeframe that I did get my hands on briefly) also
some firmware that you'd need to download for them to actually start.
The Bancomm PCI cards (variously branded as Datum and Symmetricom) must
have been built by the shipload as you can still buy them NOS, the later
models at least seemed to have EEPROM, so no need for downloading the
firmware anymore.

In other words, while there may have been blobs there, none of them were
actually in NTPd.  I've started using xntpd on Sun hardware when the
method of distribution was still a QIC250 tape that was sent around via
the postal service, so I would very much have noticed if there were any
blobs besides the actual source code.

>> > I wrote about this bit of history because it's a precedent for
>> > narrowing our hardware support in order to improve our security
>> > and reduce our expected downstream defect rate.
>> 
>> Before you start to go down that road remind us again what threat model
>> you are trying to protect against.  Any talk about security is hollow
>> theater without that bit of information.
>
> Whatever your threat model is, reducing attack surface is effective
> security hardening.  Reducing total LOC and complexity in the codebase
> reduces attack surface.  Thus, reducing LOC anywhere you can do it 
> is a hardening strategy.

As a strategy it's fine, as a criterion for deciding what to let go it's
useless.  That is very much the point I was trying to make: your
proposed criterion doesn't actually tell us anything about which threat
you are going to mitigate.

> If you're only just now noticing that this is NTPsec development's
> central thrust, and has been since 2015, and that judging by CVEs in
> Classic that we've evaded it has been rather spectacularly successful,
> maybe you ought to be paying closer attention to what we're actually
> doing and achieving before you criticize.

How many of these were related to device support, obsolete or not?

>> > NTPsec aims to be highly secure and reliable.  If we're serious about
>> > that, we need to reduce our vulnerability to defects from these
>> > wraparound/rollover problems. 
>> 
>> You won't make even a tiny step in that direction based on your current
>> understanding of the issues.
>
> Please read https://docs.ntpsec.org/latest/rollover.html so you won't 
> be under any misapprehensions about what we understand.  You might
> also want to read the big comment at  
>
> https://gitlab.com/gpsd/gpsd/blob/master/timebase.c
>
> You can see from that how firm a grasp Gary Miller and I had on these
> problems before NTPsec.

Appeal to authority won't get you anywhere while you continue to skirt
the actual discussion.  But the first of the two citations is in fact a
lot more careful and nuanced in its claims than your broad-brushed
missive regarding device support.

> Yes, in the presence of era wraparounds perfect resolution of absolute time
> is not possible. We're not under any illusion that it is. What  *is* 
> achievable to to reduce the complexity of the failure cases and make the
> code better at self-auditing and notifying a human when it enters a bad 
> state.

Yet you haven't addressed the actual failure cases and how you plan to
mitigate them.

> Generally speaking, you can tell improvement of this kind is happening
> any time you rip out old shims.  The code that prevented autonomous
> operation from working at all before I fixed it in 2017 was, I believe,
> an old shim from the early days of the Y2K panic.

More anectodal arguing.

>> > My thinking was that we would eventually drop all of the 2-digit-only
>> > modes and drivers, and say "if your refclock doesn't ship 4-digit
>> > years, it's disqualified".  Besides the autonomy issue, devices with
>> > this quirk are often very old hardware with wraparound problems.
>> 
>> So, all GPS receivers, to start with? 
>
> No, but it is conceivable that we might someday disqualify NMEA receivers
> that don't ship a ZDA sentence.

Based on what argument?

> Yes, of course the ZDA payload will be wrong after a wraparound. By
> removing the kludges that try to deduce a century from a two-digit
> year, though, we'd make the code to detect failure cases easier to
> reason about and be able to assert stronger invarients.

You've already cited an ntpsec document that (correctly) states that
this is just not going to happen as each GPS 

Re: Neoclock-4X driver removal

2019-08-11 Thread Eric S. Raymond via devel
Achim Gratz via devel :
> True, however the "cheap GPS connected over USB" one can actually buy
> are mostly using USB serials that don't even allow DCD to be handed
> through and hence have no PPS.  PPS over USB serial is still rare to find.

Not any more.  You can buy them from Mark on Etsy for $50.

https://www.etsy.com/listing/501829632/navisys-gr-701w-u-blox-7-usb-pps

He's sold about 150 of them.

That happens to be my design inside that Navisys case.  The 1PPS gets
mapped to a USB priority packet that will arrive at the host on the
next poll. Maximum latency is the 1ms poll interval, average is half
that.  This is measured performance, not theoretical.

What's amusing about this is that it's literally a one-wire patch to
a bog-standard PL2303 reference design; I figured it out by looking 
at specsheets.  

> For the benefit of other readers: The Macx-1 GR-601W seems no longer
> obtainable, but the successor products GR-701W and GR-801W may be.  I've
> instead switched to NavSpark mini modules plus an FTDI breakout board
> that has the full set of serial signals.

Correct, the 601W (the original Macx-1) has been EOLed. Mark was
selling the GR-701W last I checked.  The 1PPS-to-DCD patch is simple
enough not to care what rev of the u-blox is on one end of it.  They
might have upgraded what they're shipping him to the u-blox 8 and we
would not necessarily have noticed.  All you have to do look for the W
suffix.

Yes, most of the resto of the USB GPS world has not caught up with
this simple patch.  But there was one other back in 2012 when I
designed this, a USB GPS stick assembled in Brittany of all places.

> So, what makes 2ms a number that lets you throw a driver out and 1ms a
> number that lets you keep it?

That's not the predicate I'm using.

The Neoclock4x driver is now gone from my personal repository not
because of its accuracy limit alone but because of a combination of
factors.  The hardware has vanished off the face of the net - you
can't even find it on eBay, which will cheerfully sell not just 57 flavors
of obsolete GPSes but pre-modulation-change WWVB receivers that don't
even *work*.  The vendor has vanished too (this is what's news since
2015).  The driver is unsupported since 2009, ten years ago.

According to our project policy, a driver is eligible for removal when
it's been dead for seven years. The Neoclock4x gives every appearance
of having been dead for at least that long.  Reading between the
lines, it was a small-batch side project by a small software company; I
wouldn't be surprised if fewer than 100 of the things ever existed.

It *is* considered a strike against retention if a device has
performance significantly worse than the state of the art in cheap
GPSes (and in 2019 that means a Macx-1) but there's a specific
exception to that for radio clocks with holdover.  Thus, the
Neoclock4x's 2ms jitter isn't a deal-killer by itself, but it is...
marginal.  Not good enough to make an argument for keeping it in
support.

I haven't pushed the deletion yet.  I'm not in a hurry about this;
someone could pop up with news that the hardware is not dead.

I am going to be eying the other drivers that only ship 2-digit
years with a view to removing them as soon as other circumstances
justify it.  Spectracom Type 2, Arbiter, TrueTime, and OnCore
are tops on that hit list.

> >> Except that GPS still needs clear view of a relatively large portion of
> >> the sky and VLW doesn't, aside from all the interference and signal
> >> propagation issues that it has too, because it is operating just on a
> >> different band of RF.
> >
> > What you say in true in theory.
> 
> Well, it's true in practise as well.  This is a result of the physics of
> electromagnetic wave propagation and the constraints on where you can
> put the computers and an antenna for the receiver.  If you care to look
> what stratum-1 servers you get back from the NTP pool you'll see that
> certain colocation centers have only VLF and no GPS among their clients.

I'm sure that's true.  I'm also sure the share of GPS-only colos is
much larger.  Especially in the US, where most of the most densely populated
part of the country - the northeastern seaboard - is out of reliable
propagation range for WWVB.

> > In practice, experience in the U.S.
> > tells us pretty clearly that the tradeoff is in favor of GPS.
> >
> > How do we know this?  After the WWVB modulation change in 2012, all
> > the Amwerican clock-radio vendors moved to GPS-conditioned units *and
> > never looked back.*  Longwave receivers are no longer worth the NRE
> > to build them here.
> 
> That example is irrelevant to the discussion since the application
> (clock radio) has completely different operational and economical
> constraints from the one we're discussing (NTP stratum-1 refclock).

No, it's the same conversation. By "clock radio" I didn't mean the
cheap wall clocks, I meant the high-precision Stratum 1 radios that used
to exist in the U.S. but don't any 

Re: Driver strategy - we need to decide among incompatible goals

2019-08-11 Thread Eric S. Raymond via devel
Achim Gratz via devel :
> In other words, while there may have been blobs there, none of them were
> actually in NTPd. 

The fact that they had to be linked to the kernel rather than being in
userland made them *worse* security risks.  Those were the first
drivers I dropped.

> > Whatever your threat model is, reducing attack surface is effective
> > security hardening.  Reducing total LOC and complexity in the codebase
> > reduces attack surface.  Thus, reducing LOC anywhere you can do it 
> > is a hardening strategy.
> 
> As a strategy it's fine, as a criterion for deciding what to let go it's
> useless.  That is very much the point I was trying to make: your
> proposed criterion doesn't actually tell us anything about which threat
> you are going to mitigate.

Who said I was trying to use it to decide what specific things to remove?

The strategy implies that everything that can be removed should be.  It's
a reversal of Classic's policy of not throwing away code ever. But just
having that as a goal doesn't say what to get rid of.

To decide what should be removed in what order one has to apply other
criteria.  Like "Will this driver ever be used again?".  Removing all
the ones dependent on the old WWVB modulation, for example, was a
particularly easy decision.

> > If you're only just now noticing that this is NTPsec development's
> > central thrust, and has been since 2015, and that judging by CVEs in
> > Classic that we've evaded it has been rather spectacularly successful,
> > maybe you ought to be paying closer attention to what we're actually
> > doing and achieving before you criticize.
> 
> How many of these were related to device support, obsolete or not?

CVEs?  Not many: two, maybe three IIRC.  Autokey has a much worse defect 
history.
But when you're running a strategy centered on attack-surface reduction you
squeeze out code *everywhere you can* and that is exactly what we have done, for
a reduction in codebase size of 4:1.

> >> > NTPsec aims to be highly secure and reliable.  If we're serious about
> >> > that, we need to reduce our vulnerability to defects from these
> >> > wraparound/rollover problems. 
> >> 
> >> You won't make even a tiny step in that direction based on your current
> >> understanding of the issues.
> >
> > Please read https://docs.ntpsec.org/latest/rollover.html so you won't 
> > be under any misapprehensions about what we understand.  You might
> > also want to read the big comment at  
> >
> > https://gitlab.com/gpsd/gpsd/blob/master/timebase.c
> >
> > You can see from that how firm a grasp Gary Miller and I had on these
> > problems before NTPsec.
> 
> Appeal to authority won't get you anywhere while you continue to skirt
> the actual discussion.  But the first of the two citations is in fact a
> lot more careful and nuanced in its claims than your broad-brushed
> missive regarding device support.

Surprise! I wrote that entire "careful and nuanced" discussion myself.
You didn't know how much I know before you read it. You should at
least consider the possibility that four years of success hacking at
this giant hairball has taught me things *you* don't know.

> > Yes, in the presence of era wraparounds perfect resolution of absolute time
> > is not possible. We're not under any illusion that it is. What  *is* 
> > achievable to to reduce the complexity of the failure cases and make the
> > code better at self-auditing and notifying a human when it enters a bad 
> > state.
> 
> Yet you haven't addressed the actual failure cases and how you plan to
> mitigate them.

That's because I don't have a detailed plan.  I'm learning my
way into the problem, simplifying as I go.

I've already collected one major gain from this process.  Now you can
recover from a trashed system clock if you trust your clock sources
not to be lying to you about the year.  Of course they will sometimes
suffer era rollovers and tell a lie, but it's a hell of a lot easier
to detect that failure when you're looking at 4-digit years than at
two-digit year parts.

That last is an example of what I mean by simplifying the range of
failure modes.  My plan is to just keep chiseling away the rock until
I can wrap my head around all the failure interactions and produce
something like a proof of behavior.

Even if I never get to that point, every one of the simplification
steps required to go in that durection pushes the code in the
direction of better maintainability and auditability.

> > Generally speaking, you can tell improvement of this kind is happening
> > any time you rip out old shims.  The code that prevented autonomous
> > operation from working at all before I fixed it in 2017 was, I believe,
> > an old shim from the early days of the Y2K panic.
> 
> More anectodal arguing.

What you call "anecdotal arguing" is what you get when you're working 
a heuristic with a telos rather that a plan where you can spec the form
of the final solution.

That's all I have.  Because some problems are so messy