> On Jan 27, 2015, at 4:15 PM, Peter Vince <[email protected]> wrote:
> 
> 
> On 27 January 2015 at 23:03, Warner Losh <[email protected]> wrote:
> ...
> > A “cold spare” that’s sitting on the shelf powered off for more than 6 
> > months. When
> > the original fails, the hot spare is returned to service and must wait an 
> > additional ~30
> > minutes to get the latest almanac before it can recover UTC time from GPS 
> > time.
> > This 30 minutes of down time puts the system at < 4 9’s of reliability, and 
> > is an
> > unacceptable delay. ...
> 
> That assumes you can instantly get the cold unit off the shelf and plugged 
> in.  I suspect that actually doing that will, in most cases, take well over 
> half an hour, and hence there goes your 4 9's, regardless of how quickly the 
> unit can give an accurate output.

The time to detect a full failure of the system is measured in tens of seconds.
The time to dispatch remote hands to the rack with the system in it with a trip 
by the
cold spares room is likewise measured in high single digit minutes. The time to 
power
down the chassis, and swap out the failed system and boot the new system is 
measured
again in single digit minutes. Time from failure to reboot can be less than 10 
minutes
in most cases (though not all). Having to wait the extra 30 minutes for the 
almanac
thus quadrupled the outage time. The system had a number of redundant elements
to it, and the ability to signal when it wasn’t running in fully redundant 
mode, but the
time spent on ‘not fully redundant’ was counted against the reliability goals. 
And since
the redundant elements weren’t supposed to know about each other, asking the 
redundant
element for leap information wasn’t an option, though we did find a way to 
store the data
in multiple places so that often (but not always) we could recover it and not 
pay the
30 minute penalty w/o forcing operator intervention. Since this was a military 
installation
as well, that added its own set of arbitrarily complications which were better 
worked
around in software. And since the military was involved, much analysis was done 
on
worst case scenarios, rather than typical scenarios which would have made the
problem simpler.

And yes, I’ve actually deployed systems like this, and argued for the 
simplifications
and information sharing that would obviate the need for an almanac download. 
While
clever and all that, they didn’t survive a worst case scenario analysis. It was 
utterly
stupid, but also utterly avoidable if you had 5 years of leap seconds available.

To make matters worse, this was the primary timing system for a well known 
navigation
system, so any outage was amplified well in excess of its actual effect, not to 
mention
requirements for data logging that were fundamentally at odds with fast restart.

This is a real-life example of the law of unintended consequences and 
well-meaning
good ideas...

Warner
_______________________________________________
LEAPSECS mailing list
[email protected]
https://pairlist6.pair.net/mailman/listinfo/leapsecs

Reply via email to