On 30 Dec 2012, at 21:34, Brian Dickson wrote: > But, the basic problem is this: missing an UPDATE won't trigger either > condition, it will at worst cause sub-optimal routing. Missing a WITHDRAW > _CAN_ cause Bad Things (TM) to happen.
If we tear down the session based on a single bad UPDATE, then Bad Things⢠happen. Worse still, these bad things happen to NLRI that are not associated with the UPDATE that was erroneous and were supporting live customer services. With modern deployments of BGP, the sessions that are torn down can be relatively fundamental to network operation, carry multiple (unrelated) customer topologies, and have significant recovery times. Based on this, the tear-down behaviour has been shown to have harmful effects to the operation of real networks based on real world errors [0], and as per the draft that this thread is discussing. There are a number of deployments in which this is not acceptable. We're spinning round the same loop here that was discussed a number of times. Let me try and summarise this as accurately as I can: As soon as we accept any kind of mechanism that targets error handling to particular NLRI, we KNOW that we will compromise correctness. (As per Section 4.1 of the draft) We also KNOW that this has a cost - which is that we cannot be sure that the routing system is entirely consistent any more, and we might end up with stale NLRI, routing loops, or blackholing. The operational requirement described in this draft highlight that this _IS_ the cost of robustness, and highlight the need for operational monitoring and recovery mechanisms to support this relaxation. Further to this, they highlight ways that the current session-reset behaviour can be improved, particularly relevant in cases where we CANNOT apply more targeted error handling. We are discussing whether we can wholly address the "when there has been an error in an UPDATE, NOTHING is certain" - I would assert that we do not need to. The fundamental premise here is that we are compromising protocol correctness, which has a cost in terms consistency. Operationally, paying this cost in order to keep numerous other services up and running is acceptable. If this cost is not acceptable to an operator, then such mechanisms SHOULD NOT be deployed. There is an argument that says that we could use a modified session-reset based on NOTIFICATION after erroneous UPDATE approach for all errors, however, since we also have a concern around longer-lived errors (that are sourced by a remote speaker, and hence just repeated on session reset), and the scalability of BGP routers, optimising such that where it is possible to do so, we make minimal compromises to correctness (i.e.., we compromise the correctness of a limited subset of NLRI, where we can identify them) and maintain sessions is a key requirement. If there are ways that we can improve the message packing in BGP such that we minimise the risk of compromising correctness with NLRI-targeted error handling, absolutely we should try and pursue these in draft-ietf-idr-error-handling - however, I think that there is consensus that error handling as a problem space is something we need to address, since there are work items for this in both IDR and GROW. Would the GROW working group be happy if we address Chris' concern related to "lost NLRI" (which AFAICS is really the case where we have >1 type of NLRI attribute within a single message) by adding a note that an error remains Non-Critical if _at least one_ NLRI attributes can be successfully parsed? I'm unclear here as to whether we're addressing a common situation where multiple sets of NLRI are being contained within a single message [1]. If so, then adding "at least one" and then a further point that an implementation SHOULD use a single NLRI attribute per UPDATE message, and put this at the start of the attributes would seem to be a fair way forward. Does this sound a reasonable way forward? If so, I will update this in an -07, as well as addressing the editorial issues that Chris highlighted earlier. Many thanks for your feedback. HNY! r. [0]: I enumerated a small number of real-world incidents back in 2011 when I presented this work at NANOG (slides: http://rob.sh/files/nanog-slides.pdf). There have been numerous incidents since, alongside many incidents that the operators involved were not willing to share more public details of. It would be a massive shame to me that when the inevitable next incident like this turns up in the DFZ, or internally to a SP network, we are still questioning the validity of the problem space. [1]: Trawling a number PCAPs that I have, I can't find an UPDATE that contained MP_REACH_NLRI and MP_UNREACH_NLRI simultaneously from a multi-vendor L3VPN network. Another operator informs me that they have a certain implementation that appears to include both in a single message. _______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
