On 30 Dec 2012, at 21:34, Brian Dickson wrote:

> But, the basic problem is this: missing an UPDATE won't trigger either 
> condition, it will at worst cause sub-optimal routing. Missing a WITHDRAW 
> _CAN_ cause Bad Things (TM) to happen.

If we tear down the session based on a single bad UPDATE, then Bad Thingsā„¢ 
happen. Worse still, these bad things happen to  NLRI that are not associated 
with the UPDATE that was erroneous and were supporting live customer services. 
With modern deployments of BGP, the sessions that are torn down can be 
relatively fundamental to network operation, carry multiple (unrelated) 
customer topologies, and have significant recovery times. Based on this, the 
tear-down behaviour has been shown to have harmful effects to the operation of 
real networks based on real world errors [0], and as per the draft that this 
thread is discussing. There are a number of deployments in which this is not 
acceptable.

We're spinning round the same loop here that was discussed a number of times. 
Let me try and summarise this as accurately as I  can:
 
As soon as we accept any kind of mechanism that targets error handling to 
particular NLRI, we KNOW that we will compromise correctness. (As per Section 
4.1 of the draft) We also KNOW that this has a cost - which is that we cannot 
be sure that the routing system is entirely consistent any more, and we might 
end up with stale NLRI, routing loops, or blackholing. The operational 
requirement described in this draft highlight that this _IS_ the cost of 
robustness, and highlight the need for operational monitoring and recovery 
mechanisms to support this relaxation. Further to this, they highlight ways 
that the current session-reset behaviour can be improved, particularly relevant 
in cases where we CANNOT apply more targeted error handling.

We are discussing whether we can wholly address the "when there has been an 
error in an UPDATE, NOTHING is certain" - I would assert that we do not need 
to. The fundamental premise here is that we are compromising protocol 
correctness, which has a cost in terms consistency. Operationally, paying this 
cost in order to keep numerous other services up and running is acceptable. If 
this cost is not acceptable to an operator, then such mechanisms SHOULD NOT be 
deployed.

There is an argument that says that we could use a modified session-reset based 
on NOTIFICATION after erroneous UPDATE approach for all errors, however, since 
we also have a concern around longer-lived errors (that are sourced by a remote 
speaker, and hence just repeated on session reset), and the scalability of BGP 
routers, optimising such that where it is possible to do so, we make minimal 
compromises to correctness (i.e.., we compromise the correctness of a limited 
subset of NLRI, where we can identify them) and maintain sessions is a key 
requirement.

If there are ways that we can improve the message packing in BGP such that we 
minimise the risk of compromising correctness with NLRI-targeted error 
handling, absolutely we should try and pursue these in 
draft-ietf-idr-error-handling - however, I think that there is consensus that 
error handling as a problem space is something we need to address, since there 
are work items for this in both IDR and GROW.

Would the GROW working group be happy if we address Chris' concern related to 
"lost NLRI" (which AFAICS is really the case where we have >1 type of NLRI 
attribute within a single message) by adding a note that an error remains 
Non-Critical if _at least one_ NLRI attributes can be successfully parsed? I'm 
unclear here as to whether we're addressing a common situation where multiple 
sets of NLRI are being contained within a single message [1]. If so, then 
adding "at least one" and then a further point that an implementation SHOULD 
use a single NLRI attribute per UPDATE message, and put this at the start of 
the attributes would seem to be a fair way forward.

Does this sound a reasonable way forward? If so, I will update this in an -07, 
as well as addressing the editorial issues that Chris highlighted earlier.

Many thanks for your feedback.

HNY!
r.

[0]: I enumerated a small number of real-world incidents back in 2011 when I 
presented this work at NANOG (slides: http://rob.sh/files/nanog-slides.pdf). 
There have been numerous incidents since, alongside many incidents that the 
operators involved were not willing to share more public details of. It would 
be a massive shame to me that when the inevitable next incident like this turns 
up in the DFZ, or internally to a SP network, we are still questioning the 
validity of the problem space.

[1]: Trawling a number PCAPs that I have, I can't find an UPDATE that contained 
MP_REACH_NLRI and MP_UNREACH_NLRI simultaneously from a multi-vendor L3VPN 
network. Another operator informs me that they have a certain implementation 
that appears to include both in a single message.


_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Reply via email to