I figured that labs-l may be interested in my initial incident report as it caused a partial outage for labs. anything starting with a cr is "core router." I have removed vendor names from this report.
Root cause : flapping link and then light received but no packets passing through the cr1-sdtpa to cr2-eqiad (preferred) link caused ospf and bgp to partially go down between the datacenters, causing an outage for production for those coming in via Tampa and an outage for labs for those coming in via Eqiad. 17:52 link between cr1-sdtpa and cr2-eqiad starts flapping 18:04 first icinga reports of downtime 18:05 link stops flapping but no traffic will pass through it 18:10 leslie: switched ospf metric on the cr2-eqiad to cr1-sdtpa link to try to make the traffic route via the hopefully working link 18:10 services start reporting back online, outage for labs coming in via eqiad is alleviated, outage for most transit coming in via Tampa is alleviated. 18:18 Folks coming in via $MAJOR_TRANSIT_PROVIDER, via $OTHER_PROVIDER transit finally can reach site again (possibly route dampening from flapping? Unrelated outage? Some sort of physical cut in the area?) I have called $VENDOR and have ticket X. They say they are fine between Tampa and Orlando, and are calling $OTHERVENDOR about the wave now. The link is currently passing traffic, but is still downpreffed via ospf to 40,000. Semi-related but important note - Leslie set log-updown for all bgp peers in order to facilitate better investigations into outages. -- Leslie Carr Wikimedia Foundation AS 14907, 43821 http://as14907.peeringdb.com/
_______________________________________________ Labs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/labs-l
