Hi Brian, We discussed this at some length previously (the -00 of the requirements draft discussed automatically triggering some form of means to recover from the error). I think that there is some merit in doing so, but there are clearly also scaling considerations around this (in terms of requiring both BGP speakers to re-generate/re-process UPDATEs).
The consensus that was reached in the previous discussion was that this is not necessarily going to be of advantage in terms of causing a different UPDATE message to be generated. Particularly, since a ROUTE REFRESH would result in the same UPDATE packing (generally). To that end, what the requirements draft currently says is that: - ROUTE-REFRESH is a reasonable way to re-run a consistency check between two peers (especially with start-of-refresh and end-of-refresh markers, since this means that we can purge any NLRI that we missed being withdrawn). - If there is anything automatically triggered, it would be prudent to try and ensure that we can make as specific a request to the other speaker as possible - this is where mechanisms such as One-Time ORF were suggested, and is a similar space to that which would be solved through having the UPDATE-VERSION message described in draft-ietf-idr-enhanced-gr if one were to be able to have a REFRESH following a "last known good" position. Making a more specific request may result in a different UPDATE being generated, which may mean the error is fixed, but this is not guaranteed. - If the selected best path changes on the remote speaker, then it will be re-advertised anyway - resulting in another chance to re-parse these NLRI. So, certainly, if we relax the "correctness" requirement, then it would seem a good idea to have some means to be able to consistency check -- it's just a matter of ensuring that this is done in an effective and scalable manner, that does not affect normal BGP operation. Cheers, r. On 28 Dec 2012, at 19:02, Brian Dickson <[email protected]> wrote: > Here's a quick question, about possible mechanisms for determining whether or > not we need to tear down a session. > > If there is a chance that some NLRI weren't properly decoded: > > - What about requesting (presuming the option was negotiated) a > route-refresh, or a "confirm per-AFI-SAFI prefix list"? > > It's an expensive "parity check" but is one way of ensuring that we haven't > missed a withdrawn prefix. > If it isn't in the subsequently received list of prefixes, we missed it and > should withdraw it. > Everything else is a no-op - maybe we missed a new prefix (with new > attribute?), but that is what "treat as withdraw" gets you. > > Just trying to make sure the cure isn't worse than the disease, in all > situations. > (Where "cure" is fail to withdraw because we "lost" the withdrawal because of > malformed update, or "cure" is tear down the session.) > > Brian > > On Fri, Dec 28, 2012 at 1:04 PM, Chris Hall <[email protected]> wrote: > Rob Shakir wrote (on Fri 28-Dec-2012 at 13:10 +0000): > > > > (re: CCing IDR & GROW) > > > > On 28 Dec 2012, at 12:28, Chris Hall wrote: > > > > > Rob Shakir wrote (on Thu 27-Dec-2012 at 18:44): > > >> > > >> Any comments very welcome (to me or grow@). > > > > I'm afraid I still don't get it :-( What am I missing ? > > > > > > UPDATE Message Length errors are Critical because they: > > > > > > (1) "result in cases whereby the NLRI attribute cannot > > > be correctly extracted". > > > > > > The implication is that a failure to extract all NLRI is Critical. > > > Is that a requirement ? > > > If the NLRI cannot be determined, then this is a Critical error, > > yes. I left the wording relatively open on whether this is *all* > > NLRI, ... > > OK. So, if the message is broken (in any way): > > * if no NLRI can be found, that's a Critical Error. > > That would include the case where some broken > attribute upstream of any MP_XXX obscures that > MP_XXX. > > * if some NLRI can be found, that's a Non-Critical error. > > Such NLRI as can be found are treated-as-withdraw. > > AND any NLRI that may or may not have been lost, are > ignored. > > AND whatever dangers lost NLRI may pose are covered > by the caveats in 4.1. > > Yes ? > > > ... as I am not sure that in the requirements draft we should > > specify direct solutions to specific issues, to e.g., say how to > > handle cases where MP_REACH_NLRI and MP_UNREACH_NLRI are in the same > > message [this is a case that I do not believe is forbidden by > > rfc2858 - if the working group could clarify whether this is > > something that we feel the draft needs to handle or can explicitly > > be omitted, then that would be appreciated]. > > The RFCs also allow any mix of IPv4 Unicast Reachable/Unreachable in > the body of the message, with or without MP_REACH_NLRI and/or > MP_UNREACH_NLRI. If we call each of these a "collection of NLRI", > then there may be between 1 and 4 collections of NLRI in a message > (ignoring the End-of-RIB pseudo-UPDATE). It is an error to have more > than one MP_REACH_NLRI attribute or more than one MP_UNREACH_NLRI > attribute. It is not clear to me whether those should be Critical > Errors -- but while we are relaxing all error checking, I don't see > why they should be. > > It is possible that most implementations actually only send one > collection per message... so, as you suggest, on a you-know-and-I-know > basis, we'all could ignore the "theoretical" issues of more than one > collection in a message. > > But, this does not matter if the requirements allow lost NLRI to > simply be ignored. If the sender sends any MP_XXX first, then there > is little chance that NLRI will be lost, and things work better. If > the sender follows the old rules, and particularly if it sends more > than one collection at a time, then there is more chance that NLRI > will be lost. > > > > Later: > > > > > > (2) "All errors whereby the contained NLRI can be > > > extracted are referred to as Non-Critical". > > > > > > And that includes: > > > > > > (3) "where the length of all path attributes contained > > > within the UPDATE does not correspond to the > > > total path attribute length." > > > > > > That is, at least, more explicit than > > > draft-ietf-idr-error-handling-03, which glosses over (3). > > > > > > But if (3) is non-critical then there is some chance that > > > some NLRI will not be extracted, which appears to violate > > > (1) and (2). > > > Disclaimer: As I am sure that my comments previously have made > > clear, I do not maintain a code base for a BGP daemon/implementation > > - so please feel free to correct my logic below. > > > > I do not believe that (3) implies that the NLRI cannot be correctly > > found. > > What (3) implies to me is that somewhere amongst the attributes > something is broken. As Jakob Heitz succinctly puts it: "Once you > have a malformed update, NOTHING is certain." In particular, you > cannot be certain that all NLRI referred to by the sender can be found > by the receiver. > > > If the sum of total length is incorrect, then we can still > > extract the individual attributes - we just find that there is not > > enough data to fill the overall length we were told and/or we have > > too much attribute data compared to the total attribute length. > > This is precisely where I think the requirements come unstuck. There > is almost no redundancy in the attribute encoding. And almost all the > redundancy there is is discarded by the new-form-error-handling. So, > if the sum of the attribute lengths is incorrect, then (inter alia) > the receiver CANNOT know whether the attributes it has extracted are > the attributes the sender intended to send. In particular, the > receiver cannot know it has extracted all NLRI (except in the, > probably obscure, case of having extracted both MP_REACH_NLRI and > MP_UNREACH_NLRI). "Once you have a malformed update, NOTHING is > certain." > > Conversely, if the sum of attribute lengths is correct, then there is > a fighting chance that the attributes received are the attributes > sent. But, if one of those attributes is (say) nominally an > ATOMIC_AGGREGATE which is apparently 200 octets long, that might be a > worry. > > ... > > > Then (4) "In order to maximise the number of cases whereby the > > > NLRI attributes [plural, now, BTW] can be reliably extracted > > > from a received message...". Ah. So it is not a Critical > > > Error if "the NLRI attribute cannot be correctly extracted". > > > No - it is a Critical error if we cannot extract the NLRI. This > > recommendation is to give an increased chance that the NLRI can be > > extracted as per the IDR error handling draft. This then (by virtue > > of resulting in the NLRI being extracted) minimises the number of > > cases that result in a Critical error. > > By this do you mean "all" NLRI or only "some" NLRI ? As above. > > > The plural here is to reflect > > that the existence of >1 type of NLRI attribute. > > That would imply that it should be "NLRI attributes" throughout ? > Though in most places where the draft speaks of "NLRI attribute" I > think it means one, some, any or all of the "collections of NLRI" (as > defined above). > > > > For me the requirement remains "conflicted". On the one hand it > > > seems to say that it is a Critical Error if the NLRI cannot be > > > extracted and parsed. On the other it seems to say it's OK if > > > you cannot extract some NLRI. > > > If you'll forgive me for removing a significant proportion of your > > message, I think that we need to take another step back here. It > > seems to me that the key question that you are highlighting is "What > > level of confidence do we need to have before we declare that the > > NLRI cannot be extracted?" -- do you agree? > > Absolutely. > > > From an operator perspective, I would like to compromise *certainty* > > for *robustness*. You are right, we are compromising correctness > > here, we might end up withdrawing an incorrect NLRI and impacting > > service operation for that prefix - however, it is somewhat > > preferable to me to withdraw a a subset of the NLRI incorrectly, > > rather than impact all NLRI in one single action. We clearly need to > > provide some bounds on how much we compromise the certainty (and > > live within the realms of possibility, such that we are not just > > taking a shot in the dark). This is what the definitions of Critical > > and Non-Critical within the document are intended to provide. Once > > again I will refer to the requirement that there is a balance > > between correctness and robustness - rather than a locally risk > > averse approach that results in harmful wider behaviour. > > Where one can extract the NLRI from a broken message, then > treat-as-withdraw is AFAICS no worse for the NLRI in question than > session-reset, and hugely better for all the other NLRI received from > the peer. > > Where not *all* the NLRI in a message have been extracted, then we > take a step beyond withdrawing something which should not have been > withdrawn. For each of those NLRI the receiver will (unknowingly) do > one of: > > (a) continue using a route which the sender has > withdrawn; > > (b) continue to use an out of date version of a route > which the sender has changed in some way, possibly > materially; > > (c) to fail to use a new route the sender has now made > available. > > Of these (a) looks serious and (b) could be, but certainly the effect > is different from session-reset; while (c) is, essentially, > treat-as-withdraw. If some risk "lost routes" is taken, then the > impact of that clearly must be weighed against the impact of > session-reset. I have not found a discussion of the impact of "lost > routes" in the draft, so I have no idea what the trade-off is, here. > > For completeness, if all forms and degrees of attribute broken-ness > are required to be acceptable, then there are cases where the receiver > cannot know whether *all* NLRI have been extracted, or not. > > > Is it acceptable that we leave this as guidance within the > > requirements? If not, please could you suggest how the definitions > > of Critical/Non-Critical could be altered to address your concerns? > > Without a change at the sender end, the problem is that: "Once you > have a malformed update, NOTHING is certain." > > So, as I said earlier, the fundamental question is: > > (F) is it OK to continue with a session after processing > a message which may have contained some NLRI which > could not be extracted ? > > As you say, the question hinges on the "which may have contained". As > above, it seems to me that the draft skates over the issue of "lost > routes". > > It also hinges on the operational impact of such "lost routes", on > which I am unqualified to pronounce. > > It seems to me there is a range of possibilities: > > 1) at one end of the spectrum, if "lost routes" are to > be avoided at all costs, then the rules for parsing > attributes need to be tight and without a change at > the sender end, what is achievable is constrained. > > 2) if "lost routes" are acceptable in the "theoretical" > cases, but you-know-and-I-know that most of the > time the case will not arise... > > ...or "lost routes" are acceptable provided some > defined steps are taken to identify as much NLRI > as is reasonably possible... > > ...then the requirements need to (a) make the case > and (b) describe the trade-off (so that designs > made to follow the requirements can be judged in > those terms). > > 3) at the other end of the spectrum, if "lost routes" > are always preferable to session-reset, then this > opens up an even looser approach to UPDATE message > handling, in which practically nothing need cause > a session-reset -- ie, practically nothing is a > Critical Error. > > And, as I have suggested in previous discussions, it is possible to > consider case (2) in two parts: > > a) where the sender is unchanged. > > In which case the issue is the degree of attribute > broken-ness vs the likelihood of "lost routes". > > If this is viewed as an intermediate step, then > perhaps the requirements could err on the side of > safety ? > > Perhaps this can be addressed by a "knob" allowing > more or less strictness in the parsing of > attributes ? > > b) where the sender is changed. > > In which case the issue ought to be moot, because > the receiver will then be able to know that there > are no "lost routes". > > > I would also appreciate further input from IDR as to whether this is > > sufficient requirement from GROW to allow a solution document to be > > written? > > The new requirements specify only Critical (session-reset) and > Non-Critical (treat-as-withdraw) Errors. This appears to rule out the > "attribute discard" option in draft-ietf-idr-error-handling-03. Is > that intended ? > > Chris > > _______________________________________________ > Idr mailing list > [email protected] > https://www.ietf.org/mailman/listinfo/idr >
_______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
