Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Jeff Wheeler Thu, 03 Jan 2013 16:27:05 -0800

On Thu, Jan 3, 2013 at 5:31 PM, Tony Li <[email protected]> wrote:
> A syntactic error would be any inconsistency of length fields, 
> incompatibility between a type and a length, or any other error that would 
> introduce any doubt about the parsing of the message.  Once this type of 
> error has occurred, then the remainder of the data stream is wholly in doubt. 
>  A session reset seems like the only (conservative) way of handling this, as 
> it's unclear that further data would be accurate.
>
> I'm very open to a discussion of alternatives to session flap for these 
> cases.  Should we require manual intervention before session restart?  This 
> would seem reasonable.

I've already posted an alternative.  The BGP MARKER makes it possible
to recover Message framing.  I'm not sure I think this is a good idea.
 I am sure I would wish for it at 2am on a Saturday morning, if it
fixed my problem.  So if my vendor supplied it as an option, I might
be glad for that someday.

> Semantic errors would be consistency violations within the contents of the 
> message.  In these cases, treat-as-withdraw seems reasonable.

The current draft for "treat-as-withdraw" has a great deal of
complexity.  You seem to both favor this solution, and hate the
complexity.  I also hate the complexity, and "ignore-bad-message" is
my alternative.

Would you rather the draft simply eliminate a lot of the complexity?
I would, because the complexity is what creates more opportunities for
session-reset.  It also may catalyze bugs.

None of these are "nerd knobs" when they fix your operational problem
and allow you to stop bleeding money.  TCP Path MTU Detection for BGP
is a "nerd knob."  Yet, there it is, along with plenty of other things
that might tweak your router a bit; but they're not there to save your
butt in a crisis.  That's what error-handling should do -- save your
butt.

On Thu, Jan 3, 2013 at 5:54 PM, Smith, Donald
<[email protected]> wrote:
> Other than getting it noticed I doubt a reset will ever make a misbehaving 
> router stop misbehaving.

Maybe not, but I've certainly seen it solve problems.  For example,
Cisco 6500/7600s used to send > MTU sized Ethernet frames to eBGP
neighbors if TCP MD5 for BGP was in use.  I had a eBGP session to a
transit network which would fail every few days because the Cisco box
would repeatedly try to send an illegally-large frame, my router would
discard the frames, and the Cisco would keep trying to retransmit it.
Eventually my HoldTime would expire.  As you might imagine, this took
quite some time for us to diagnose; but at least the auto-reset caused
the network to return to a good state pretty quickly.

> This is what happened recently. ALU announced attributes with 1 bit set that 
> others didn't understand. If the routers that didn't understand that bit 
> could have ignored just that element everything would have been fine.

Actually the problematic routers that reset their sessions "MUST have
ignored [this bit] when received," according to RFC4271, but they
didn't do that.  So the reason why those routers broke is that THEY
were buggy AND the originating router was buggy.  It is also arguable
that routers in the DFZ should not have propagated this 1 bit.  Some
routers didn't while others did.

So if the receiving routers weren't buggy in the first place, they
would not have experienced any problem.  However, ignore-bad-message
or treat-as-withdraw would be a good catch-all that an operator could
use, even if his router was buggy, to keep his network functioning.

On Thu, Jan 3, 2013 at 6:14 PM, Tony Li <[email protected]> wrote:
> The entire problem with the ignore option is that it leaves bogus information 
> in the network.  Suppose that the update contains an AS path change for a 
> prefix.  If you ignore the update, then you ignore that path change, AND, you 
> fail to propagate that change to your upstream neighbors.  Now, BGP's loop 
> prevention mechanism is out the window and, in the worst case, you've created 
> an inter-domain forwarding loop.
>
> If, on the other hand, you withdraw that prefix, then any alternate 
> connectivity for that prefix can come into play.
>
> In short, treat-as-withdraw is a fail safe approach.  Ignoring updates is not.

Treat-as-withdraw is NOT fail-safe.  It can create loops or blackholes
in your network.  Claiming over and over that it is safe will not make
it true.

Ignore-bad-message has bad qualities, which are just as bad as
treat-as-withdraw, except it is less complicated to implement and
there is a greater chance it can avoid a potentially crippling
session-reset loop.

-- 
Jeff S Wheeler <[email protected]>
Sr Network Operator  /  Innovative Network Concepts
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to