On Jan 3, 2013, at 2:35 PM, Michael Long wrote: > > On Jan 3, 2013, at 10:00 AM, Tony Li <[email protected]> wrote: >> >> >> All of the marketing that you're doing here is positioning this as a >> 'solution'. It's not. Yes, it will stop the flap, but it does NOTHING to >> fix or deal with the underlying bug. All it does is gloss it over, and as >> such, it will have implications in the field whereby this papers over real >> bugs and we have now promoted BGP errors into RIB errors. That's NOT making >> things easier to debug, that's just applying a band-aid. > > I understand what you are saying and I agree 100%, however, from an my > operations perspective the "fix" is the same. Either upgrade to fixed code or > policy out the offending announcement. I would rather deal with a customer > routing issue vs a frantic call from our noc saying 15+ att peers globally > are bouncing. The latter being a much bigger impact on our network.
I'm very concerned with the case of ignoring a route update and having a month-long discussion about why some route is missing from the $carrier_a network when it's being sent from $carrier_b and they show it going out just fine. You don't know there's an issue until someone reports it and your long-tail to problem resolution takes forever. > I can live with a couple of /24's not working for a few customers. I can't > have 15+ peers bouncing because of bad updates and even more peers bouncing > because of missed keepalives due to cpu pegged trying to deal with 15 peers > bouncing globally. While related, this is an implementation defect on the part of vendors and their poorly optimized TCP and BGP implementations being unable to get their basic job done. I recall vendors blaming our "slow" system CPU then finally fixing their logic defect that always returned 1 or 0 when it thought it was idle. (sometimes those if statements look really complex). >> A more constructive way to address the real problem here would be to talk >> about whether we should even re-establish the session after an error. Long >> ago, we made an implementation decision to simply retry. That would seem to >> be the real issue at hand. > > I would back this provided adequate logging as to why the session is down. It > would be much like tripping max-prefixes where we could hard clear a single > single session for debug. I could live with this. I certainly agree there needs to be better logging from the vendors. I remain convinced that attempts to address this problem will create more complex situations vs provide the desired result of a stable BGP core. - Jared _______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
