My best parsing of that ticket, with some guesses : - Infinera management card goes Really Bad, knocks out local waves, and starts spewing garbage out onto the management network - Management network propagates the garbage , other Infinera management cards get it and fall into the same state, knocking down local waves and re-spewing garbage. - Backup tunnels in place to ensure management network connectivity works all the time help propagate the garbage. - They start getting into some devices via OOB, probably rebooting. Devices come up ok, then this garbage traffic knocks them over again. - They start pulling down the backup tunnels to stop the virus from spreading, bouncing stuff again, putting filters on each device to drop the garbage traffic. - This starts to work, but then they hit other problems with linecards from devices that were bounced. - They also start hitting sites that they don't have functional OOB for, and have to get someone driving out to manually get access into.
On Sun, Dec 30, 2018 at 8:45 AM Saku Ytti <[email protected]> wrote: > Apologies for the URL, I do not know official source and I do not > share the URLs sentiment. > https://fuckingcenturylink.com/ > > Can someone translate this to IP engineer? What did actually happen? > From my own history, I rarely recognise the problem I fixed from > reading the public RCA. I hope CenturyLink will do better. > > Best guess so far that I've heard is > > a) CenturyLink runs global L2 DCN/OOB > b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU, > I've had this failure mode) > c) DCN had direct access to control-plane, and L2 congested > control-plane resources causing it to deprovision waves > > Now of course this is entirely speculation, but intended to show what > type of explanation is acceptable and can be used to fix things. > Hopefully CenturyLink does come out with IP-engineering readable > explanation, so that we may use it as leverage to support work in our > own domains to remove such risks. > > a) do not run L2 DCN/OOB > b) do not connect MGMT ETH (it is unprotected access to control-plane, > it cannot be protected by CoPP/lo0 filter/LPTS ec) > c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP) > d) do fail optical network up > > -- > ++ytti >

