Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Jared Mauch Thu, 03 Jan 2013 09:40:36 -0800

Jeff,

On Jan 3, 2013, at 11:51 AM, Jeff Wheeler wrote:

> On Wed, Jan 2, 2013 at 12:56 PM, Tony Li <[email protected]> wrote:
>> While we can do SOME things to decrease session resets, we cannot fix all 
>> cases and simply treating things as a withdraw and walking away is wholly 
>> unacceptable, as some of you will hopefully agree.  Creating arbitrary hair 
>> here is NOT going to help as the error handling code itself will become 
>> fraught with errors.
> 
> Tony,
> 
> Every operator I've asked thinks "ignore bad BGP messages," which is
> even more extreme than treat-as-withdraw, is a good idea.

Not sure who you're asking, but I think this whole draft is an idealist attempt 
to workaround software defects that may be uncorrectable as a whole.  (I say 
this as a $large_operator as measured here: 
http://as-rank.caida.org/?mode0=as-ranking&n=10&ranksort=1 )

The missing prefix because it was ignored, or routing loop because a withdraw 
was ignored will quickly change these folks minds.

Take one of the most recent defects:

http://www.cisco.com/en/US/products/csa/cisco-sa-20100827-bgp.html

The device takes a valid route on the receive side and corrupts it as it 
forwards it.  While ignoring may be one solutions, there is no way to actually 
know or get remediation of this prefix and software defect.

There are 3 classes of BGP operators:

1) Core networks
2) Edge/Mid-tier networks
3) People using it as their IGP/datacenter/vpn/private networks (these may 
eventually connect to the internet, but those UPDATE messages won't necessarily 
reach)

> I'm not saying anyone thinks this would be a good default.  These
> things can just be knobs used when they are appropriate or necessary.
> 

Methinks you underestimate the complexity that would be added to the error 
handling code.  While finding a marker/0xff may be easier, understanding the 
large block of updates in the flood of activity and low latency of large tcp 
windows make this much harder and more prone to error.

> Respectfully, you are about as wrong as one could get on this issue.
> Of course customers don't want one prefix to be broken.  Fifteen years
> of "CEF problem" have taught us all that these conditions are hard to
> troubleshoot.  However, it is often preferable to have one or many
> prefixes broken, than have a BGP session flap endlessly due to some
> bug.

For some operators, the only chance to workaround defects is to have something 
catastrophic happen to provide the justification to management to actually pick 
up the $new_software that corrects the problems you've been applying band-aids 
to. 

There are many people who now are having trouble diagnosing network problems 
due to a workaround from early 2009:

http://puck.nether.net/pipermail/cisco-nsp/2009-February/058512.html

> The vendor should have some standards body coverage for giving the
> operator this knob, and customers are right to ask for it.  Sure,
> you're giving us more rope.  Sometimes that is what we need.

I'm generally in the more-rope camp, but this is effectively throwing out years 
of well worn code path and introducing new code (and likely defects) in the 
handling of error cases which can only make things worse.  There wasn't a 
broad-reaching bgp attribute problem in 2012 that i'm aware of, putting us in 
the "there is stability in the core" camp.  I'm seeing a well intentioned but 
unneeded element of meddling here.

People on the edge also need to learn to maintain their devices.  This isn't a 
standards body issue, and IMHO off-topic for here, but an important datapoint.

- Jared

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to