> On Jan 27, 2015, at 4:15 PM, Peter Vince <[email protected]> wrote: > > > On 27 January 2015 at 23:03, Warner Losh <[email protected]> wrote: > ... > > A “cold spare” that’s sitting on the shelf powered off for more than 6 > > months. When > > the original fails, the hot spare is returned to service and must wait an > > additional ~30 > > minutes to get the latest almanac before it can recover UTC time from GPS > > time. > > This 30 minutes of down time puts the system at < 4 9’s of reliability, and > > is an > > unacceptable delay. ... > > That assumes you can instantly get the cold unit off the shelf and plugged > in. I suspect that actually doing that will, in most cases, take well over > half an hour, and hence there goes your 4 9's, regardless of how quickly the > unit can give an accurate output.
The time to detect a full failure of the system is measured in tens of seconds. The time to dispatch remote hands to the rack with the system in it with a trip by the cold spares room is likewise measured in high single digit minutes. The time to power down the chassis, and swap out the failed system and boot the new system is measured again in single digit minutes. Time from failure to reboot can be less than 10 minutes in most cases (though not all). Having to wait the extra 30 minutes for the almanac thus quadrupled the outage time. The system had a number of redundant elements to it, and the ability to signal when it wasn’t running in fully redundant mode, but the time spent on ‘not fully redundant’ was counted against the reliability goals. And since the redundant elements weren’t supposed to know about each other, asking the redundant element for leap information wasn’t an option, though we did find a way to store the data in multiple places so that often (but not always) we could recover it and not pay the 30 minute penalty w/o forcing operator intervention. Since this was a military installation as well, that added its own set of arbitrarily complications which were better worked around in software. And since the military was involved, much analysis was done on worst case scenarios, rather than typical scenarios which would have made the problem simpler. And yes, I’ve actually deployed systems like this, and argued for the simplifications and information sharing that would obviate the need for an almanac download. While clever and all that, they didn’t survive a worst case scenario analysis. It was utterly stupid, but also utterly avoidable if you had 5 years of leap seconds available. To make matters worse, this was the primary timing system for a well known navigation system, so any outage was amplified well in excess of its actual effect, not to mention requirements for data logging that were fundamentally at odds with fast restart. This is a real-life example of the law of unintended consequences and well-meaning good ideas... Warner _______________________________________________ LEAPSECS mailing list [email protected] https://pairlist6.pair.net/mailman/listinfo/leapsecs
