Jeff, On Jan 3, 2013, at 11:51 AM, Jeff Wheeler wrote:
> On Wed, Jan 2, 2013 at 12:56 PM, Tony Li <[email protected]> wrote: >> While we can do SOME things to decrease session resets, we cannot fix all >> cases and simply treating things as a withdraw and walking away is wholly >> unacceptable, as some of you will hopefully agree. Creating arbitrary hair >> here is NOT going to help as the error handling code itself will become >> fraught with errors. > > Tony, > > Every operator I've asked thinks "ignore bad BGP messages," which is > even more extreme than treat-as-withdraw, is a good idea. Not sure who you're asking, but I think this whole draft is an idealist attempt to workaround software defects that may be uncorrectable as a whole. (I say this as a $large_operator as measured here: http://as-rank.caida.org/?mode0=as-ranking&n=10&ranksort=1 ) The missing prefix because it was ignored, or routing loop because a withdraw was ignored will quickly change these folks minds. Take one of the most recent defects: http://www.cisco.com/en/US/products/csa/cisco-sa-20100827-bgp.html The device takes a valid route on the receive side and corrupts it as it forwards it. While ignoring may be one solutions, there is no way to actually know or get remediation of this prefix and software defect. There are 3 classes of BGP operators: 1) Core networks 2) Edge/Mid-tier networks 3) People using it as their IGP/datacenter/vpn/private networks (these may eventually connect to the internet, but those UPDATE messages won't necessarily reach) > I'm not saying anyone thinks this would be a good default. These > things can just be knobs used when they are appropriate or necessary. > Methinks you underestimate the complexity that would be added to the error handling code. While finding a marker/0xff may be easier, understanding the large block of updates in the flood of activity and low latency of large tcp windows make this much harder and more prone to error. > Respectfully, you are about as wrong as one could get on this issue. > Of course customers don't want one prefix to be broken. Fifteen years > of "CEF problem" have taught us all that these conditions are hard to > troubleshoot. However, it is often preferable to have one or many > prefixes broken, than have a BGP session flap endlessly due to some > bug. For some operators, the only chance to workaround defects is to have something catastrophic happen to provide the justification to management to actually pick up the $new_software that corrects the problems you've been applying band-aids to. There are many people who now are having trouble diagnosing network problems due to a workaround from early 2009: http://puck.nether.net/pipermail/cisco-nsp/2009-February/058512.html > The vendor should have some standards body coverage for giving the > operator this knob, and customers are right to ask for it. Sure, > you're giving us more rope. Sometimes that is what we need. I'm generally in the more-rope camp, but this is effectively throwing out years of well worn code path and introducing new code (and likely defects) in the handling of error cases which can only make things worse. There wasn't a broad-reaching bgp attribute problem in 2012 that i'm aware of, putting us in the "there is stability in the core" camp. I'm seeing a well intentioned but unneeded element of meddling here. People on the edge also need to learn to maintain their devices. This isn't a standards body issue, and IMHO off-topic for here, but an important datapoint. - Jared _______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
