Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Jared Mauch Wed, 02 Jan 2013 07:26:55 -0800

On Jan 1, 2013, at 12:27 PM, Chris Hall wrote:

> It is a truth universally acknowledged (AFAICS), that if NLRI in a
> broken UPDATE are treated-as-withdraw, that is no worse than
> session-reset and much to be preferred.
> 
> So, treat-as-withdraw is a reasonable thing for any implementation to
> do, by default, where it can.
> 
> The problem is that when things are broken, it may not be possible to
> identify all the NLRI -- some may be "lost".  [At this point I
> recommend: "The Engineer", AA Milne.]


I've been following this draft and wanted to chime in with some concerns.

1) in the (big) "Internet" space, lost NLRI are a huge deal for customers.

2) When we see a software defect that results in a session being closed, we 
regularly need help from the vendor to identify the offending NLRI/message.

3) The biggest problems we've seen are where vendor A and vendor B (or their 
sub-variants that run another OS) behave differently with the same NLRI.

I've seen a small number of theses cases over the past ~12 years and am very 
concerned with the amount of effort trying to error correct the error handling 
system when one side has a software defect.  I want to make it clear that all 
these cases attempting to resync at 0xffff etc are correcting for a defect.  
Some of these defects resulted in an improperly formatted UPDATE message from a 
sender, and others were the result of problems on the receiver.

The instability this possibly introduces into an enterprise or large scale 
network is of significant concern.

To solve the "treat as withdraw" (aka: possibly create a per-node routing loop) 
problem, I would expect BGP users to demand a "periodic resend all NLRIs" 
feature to flush the state, creating further entropy in the system 
unnecessarily.

At some point, the equipment that is sending or receiving the BGP message will 
need to be maintained by the owner.  The dropping of the session is meant to 
draw attention to the problem in the same way as others that use ABORT or 
ASSERT in their code to handle an unexpected condition.

We can not correct for every error, nor should we make that a goal as the 
results in writing the error handling code quickly get complex.  (Speaking as 
someone who attempted to write a BGP implementation once.. ugh).

Asbestos suit on,

- Jared
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to