Treat-as-withdraw is not a cure. Tear down the session is not a cure. All we can hope for after a malformed update is a temporary mitigation until human intervention can fix the problem.
IMO, the goal of error handling is to limit the damage, not to cure the problem. -- Jakob Heitz. ________________________________ From: [email protected] [mailto:[email protected]] On Behalf Of Brian Dickson Sent: Friday, December 28, 2012 11:02 AM To: Chris Hall Cc: [email protected]; [email protected] Subject: Re: [GROW] [Idr] Fwd: I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt Here's a quick question, about possible mechanisms for determining whether or not we need to tear down a session. If there is a chance that some NLRI weren't properly decoded: - What about requesting (presuming the option was negotiated) a route-refresh, or a "confirm per-AFI-SAFI prefix list"? It's an expensive "parity check" but is one way of ensuring that we haven't missed a withdrawn prefix. If it isn't in the subsequently received list of prefixes, we missed it and should withdraw it. Everything else is a no-op - maybe we missed a new prefix (with new attribute?), but that is what "treat as withdraw" gets you. Just trying to make sure the cure isn't worse than the disease, in all situations. (Where "cure" is fail to withdraw because we "lost" the withdrawal because of malformed update, or "cure" is tear down the session.) Brian On Fri, Dec 28, 2012 at 1:04 PM, Chris Hall <[email protected]<mailto:[email protected]>> wrote: Rob Shakir wrote (on Fri 28-Dec-2012 at 13:10 +0000): > > (re: CCing IDR & GROW) > > On 28 Dec 2012, at 12:28, Chris Hall wrote: > > > Rob Shakir wrote (on Thu 27-Dec-2012 at 18:44): > >> > >> Any comments very welcome (to me or grow@). > > I'm afraid I still don't get it :-( What am I missing ? > > > > UPDATE Message Length errors are Critical because they: > > > > (1) "result in cases whereby the NLRI attribute cannot > > be correctly extracted". > > > > The implication is that a failure to extract all NLRI is Critical. > > Is that a requirement ? > If the NLRI cannot be determined, then this is a Critical error, > yes. I left the wording relatively open on whether this is *all* > NLRI, ... OK. So, if the message is broken (in any way): * if no NLRI can be found, that's a Critical Error. That would include the case where some broken attribute upstream of any MP_XXX obscures that MP_XXX. * if some NLRI can be found, that's a Non-Critical error. Such NLRI as can be found are treated-as-withdraw. AND any NLRI that may or may not have been lost, are ignored. AND whatever dangers lost NLRI may pose are covered by the caveats in 4.1. Yes ? > ... as I am not sure that in the requirements draft we should > specify direct solutions to specific issues, to e.g., say how to > handle cases where MP_REACH_NLRI and MP_UNREACH_NLRI are in the same > message [this is a case that I do not believe is forbidden by > rfc2858 - if the working group could clarify whether this is > something that we feel the draft needs to handle or can explicitly > be omitted, then that would be appreciated]. The RFCs also allow any mix of IPv4 Unicast Reachable/Unreachable in the body of the message, with or without MP_REACH_NLRI and/or MP_UNREACH_NLRI. If we call each of these a "collection of NLRI", then there may be between 1 and 4 collections of NLRI in a message (ignoring the End-of-RIB pseudo-UPDATE). It is an error to have more than one MP_REACH_NLRI attribute or more than one MP_UNREACH_NLRI attribute. It is not clear to me whether those should be Critical Errors -- but while we are relaxing all error checking, I don't see why they should be. It is possible that most implementations actually only send one collection per message... so, as you suggest, on a you-know-and-I-know basis, we'all could ignore the "theoretical" issues of more than one collection in a message. But, this does not matter if the requirements allow lost NLRI to simply be ignored. If the sender sends any MP_XXX first, then there is little chance that NLRI will be lost, and things work better. If the sender follows the old rules, and particularly if it sends more than one collection at a time, then there is more chance that NLRI will be lost. > > Later: > > > > (2) "All errors whereby the contained NLRI can be > > extracted are referred to as Non-Critical". > > > > And that includes: > > > > (3) "where the length of all path attributes contained > > within the UPDATE does not correspond to the > > total path attribute length." > > > > That is, at least, more explicit than > > draft-ietf-idr-error-handling-03, which glosses over (3). > > > > But if (3) is non-critical then there is some chance that > > some NLRI will not be extracted, which appears to violate > > (1) and (2). > Disclaimer: As I am sure that my comments previously have made > clear, I do not maintain a code base for a BGP daemon/implementation > - so please feel free to correct my logic below. > > I do not believe that (3) implies that the NLRI cannot be correctly > found. What (3) implies to me is that somewhere amongst the attributes something is broken. As Jakob Heitz succinctly puts it: "Once you have a malformed update, NOTHING is certain." In particular, you cannot be certain that all NLRI referred to by the sender can be found by the receiver. > If the sum of total length is incorrect, then we can still > extract the individual attributes - we just find that there is not > enough data to fill the overall length we were told and/or we have > too much attribute data compared to the total attribute length. This is precisely where I think the requirements come unstuck. There is almost no redundancy in the attribute encoding. And almost all the redundancy there is is discarded by the new-form-error-handling. So, if the sum of the attribute lengths is incorrect, then (inter alia) the receiver CANNOT know whether the attributes it has extracted are the attributes the sender intended to send. In particular, the receiver cannot know it has extracted all NLRI (except in the, probably obscure, case of having extracted both MP_REACH_NLRI and MP_UNREACH_NLRI). "Once you have a malformed update, NOTHING is certain." Conversely, if the sum of attribute lengths is correct, then there is a fighting chance that the attributes received are the attributes sent. But, if one of those attributes is (say) nominally an ATOMIC_AGGREGATE which is apparently 200 octets long, that might be a worry. ... > > Then (4) "In order to maximise the number of cases whereby the > > NLRI attributes [plural, now, BTW] can be reliably extracted > > from a received message...". Ah. So it is not a Critical > > Error if "the NLRI attribute cannot be correctly extracted". > No - it is a Critical error if we cannot extract the NLRI. This > recommendation is to give an increased chance that the NLRI can be > extracted as per the IDR error handling draft. This then (by virtue > of resulting in the NLRI being extracted) minimises the number of > cases that result in a Critical error. By this do you mean "all" NLRI or only "some" NLRI ? As above. > The plural here is to reflect > that the existence of >1 type of NLRI attribute. That would imply that it should be "NLRI attributes" throughout ? Though in most places where the draft speaks of "NLRI attribute" I think it means one, some, any or all of the "collections of NLRI" (as defined above). > > For me the requirement remains "conflicted". On the one hand it > > seems to say that it is a Critical Error if the NLRI cannot be > > extracted and parsed. On the other it seems to say it's OK if > > you cannot extract some NLRI. > If you'll forgive me for removing a significant proportion of your > message, I think that we need to take another step back here. It > seems to me that the key question that you are highlighting is "What > level of confidence do we need to have before we declare that the > NLRI cannot be extracted?" -- do you agree? Absolutely. > From an operator perspective, I would like to compromise *certainty* > for *robustness*. You are right, we are compromising correctness > here, we might end up withdrawing an incorrect NLRI and impacting > service operation for that prefix - however, it is somewhat > preferable to me to withdraw a a subset of the NLRI incorrectly, > rather than impact all NLRI in one single action. We clearly need to > provide some bounds on how much we compromise the certainty (and > live within the realms of possibility, such that we are not just > taking a shot in the dark). This is what the definitions of Critical > and Non-Critical within the document are intended to provide. Once > again I will refer to the requirement that there is a balance > between correctness and robustness - rather than a locally risk > averse approach that results in harmful wider behaviour. Where one can extract the NLRI from a broken message, then treat-as-withdraw is AFAICS no worse for the NLRI in question than session-reset, and hugely better for all the other NLRI received from the peer. Where not *all* the NLRI in a message have been extracted, then we take a step beyond withdrawing something which should not have been withdrawn. For each of those NLRI the receiver will (unknowingly) do one of: (a) continue using a route which the sender has withdrawn; (b) continue to use an out of date version of a route which the sender has changed in some way, possibly materially; (c) to fail to use a new route the sender has now made available. Of these (a) looks serious and (b) could be, but certainly the effect is different from session-reset; while (c) is, essentially, treat-as-withdraw. If some risk "lost routes" is taken, then the impact of that clearly must be weighed against the impact of session-reset. I have not found a discussion of the impact of "lost routes" in the draft, so I have no idea what the trade-off is, here. For completeness, if all forms and degrees of attribute broken-ness are required to be acceptable, then there are cases where the receiver cannot know whether *all* NLRI have been extracted, or not. > Is it acceptable that we leave this as guidance within the > requirements? If not, please could you suggest how the definitions > of Critical/Non-Critical could be altered to address your concerns? Without a change at the sender end, the problem is that: "Once you have a malformed update, NOTHING is certain." So, as I said earlier, the fundamental question is: (F) is it OK to continue with a session after processing a message which may have contained some NLRI which could not be extracted ? As you say, the question hinges on the "which may have contained". As above, it seems to me that the draft skates over the issue of "lost routes". It also hinges on the operational impact of such "lost routes", on which I am unqualified to pronounce. It seems to me there is a range of possibilities: 1) at one end of the spectrum, if "lost routes" are to be avoided at all costs, then the rules for parsing attributes need to be tight and without a change at the sender end, what is achievable is constrained. 2) if "lost routes" are acceptable in the "theoretical" cases, but you-know-and-I-know that most of the time the case will not arise... ...or "lost routes" are acceptable provided some defined steps are taken to identify as much NLRI as is reasonably possible... ...then the requirements need to (a) make the case and (b) describe the trade-off (so that designs made to follow the requirements can be judged in those terms). 3) at the other end of the spectrum, if "lost routes" are always preferable to session-reset, then this opens up an even looser approach to UPDATE message handling, in which practically nothing need cause a session-reset -- ie, practically nothing is a Critical Error. And, as I have suggested in previous discussions, it is possible to consider case (2) in two parts: a) where the sender is unchanged. In which case the issue is the degree of attribute broken-ness vs the likelihood of "lost routes". If this is viewed as an intermediate step, then perhaps the requirements could err on the side of safety ? Perhaps this can be addressed by a "knob" allowing more or less strictness in the parsing of attributes ? b) where the sender is changed. In which case the issue ought to be moot, because the receiver will then be able to know that there are no "lost routes". > I would also appreciate further input from IDR as to whether this is > sufficient requirement from GROW to allow a solution document to be > written? The new requirements specify only Critical (session-reset) and Non-Critical (treat-as-withdraw) Errors. This appears to rule out the "attribute discard" option in draft-ietf-idr-error-handling-03. Is that intended ? Chris _______________________________________________ Idr mailing list [email protected]<mailto:[email protected]> https://www.ietf.org/mailman/listinfo/idr
_______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
