Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Chris Hall Thu, 03 Jan 2013 18:36:34 -0800

Jared Mauch wrote (on Thu 03-Jan-2013 at 17:40 +0000):
....
> Not sure who you're asking, but I think this whole draft is an
> idealist attempt to workaround software defects that may be
> uncorrectable as a whole.  (I say this as a $large_operator as
> measured here: http://as-rank.caida.org/?mode0=as-
> ranking&n=10&ranksort=1 )
> 
> The missing prefix because it was ignored, or routing loop because a
> withdraw was ignored will quickly change these folks minds.


I agree with you: if a software defect breaks routeing, then no amount
of extra software can put it together again.

I also agree with you: the issue of "lost NLRI" has to be addressed.
I do not know whether some routeing issues are *always* worse than
wholesale loss of routeing, or wholesale bouncing up and down of
routes.  But some folk would like to be given the choice.

I think that some extra facilities to mitigate the effects of software
defects can be devised, and each operator can then decide which ones
are appropriate to the circumstances, from time to time.

> Take one of the most recent defects:
> 
> http://www.cisco.com/en/US/products/csa/cisco-sa-20100827-bgp.html
> 
> The device takes a valid route on the receive side and corrupts it
> as it forwards it.  While ignoring may be one solutions, there is no
> way to actually know or get remediation of this prefix and software
> defect.

That is an interesting case.

I understand that the bug was: Attribute Type 99 was received with a
length of 3000 bytes; that attribute was sent out as if it was 184
bytes, except that the Length Field in the outgoing attribute still
said 3000.  This is a fine example of a broken attribute hiding any
and all attributes which follow it.  Note that the Message Length,
Withdraw Length and Total Attributes Length were consistent with each
other and with what was actually sent -- so the overall UPDATE Message
was "correctly framed".

What appears to have happened here is that some previously unused and
under-tested code for handling unknown, optional, transitive
attributes did something foolish.  3000 is 0xBB8 and 184 is 0xB8.  So,
the left hand knew that the attribute length was two bytes, while the
right hand only used the less significant byte.  If the programmer had
got things consistently wrong, and used the less significant byte of
the length throughout, there would have been far less excitement !

One can draw some comfort from the fact that the packing of the
overall message -- code which is exercised micro-second in,
micro-second out -- got things right.

A very small minority of routes carried the Attribute 99.  So,
crashing the session threw away many entirely innocent routes.  And,
of course, the bug didn't go away, so sessions cycled up and down.

In this particular case the source of Attribute 99 shut itself down
within 30 minutes -- because that was the planned duration for the
experiment, *not* because it had been identified as the source of some
calamity.  So, perhaps unusually, the source of the problem simply
went away.  I wonder if future latent bugs will be swept up as quickly
?

Suppose such a latent bug strikes a particularly common BGP
implementation.  Suppose I am a little ISP and both my Transit
Providers' border routers do something equally unfortunate.  And
suppose either:

  a) sessions bounce up and down until... until...

or:

  b) my shiny new BGP implementation recovers from this
     Message-Level error by treating-as-withdraw the
     affected prefix(es)...

...vote now :-)

Chris

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to