Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Rob Shakir Sat, 29 Dec 2012 16:05:38 -0800

Hi Brian,

We discussed this at some length previously (the -00 of the requirements draft 
discussed automatically triggering some form of means to recover from the 
error). I think that there is some merit in doing so, but there are clearly 
also scaling considerations around this (in terms of requiring both BGP 
speakers to re-generate/re-process UPDATEs).


The consensus that was reached in the previous discussion was that this is not 
necessarily going to be of advantage in terms of causing a different UPDATE 
message to be generated. Particularly, since a ROUTE REFRESH would result in 
the same UPDATE packing (generally). To that end, what the requirements draft 
currently says is that:

- ROUTE-REFRESH is a reasonable way to re-run a consistency check between two 
peers (especially with start-of-refresh and end-of-refresh markers, since this 
means that we can purge any NLRI that we missed being withdrawn).
- If there is anything automatically triggered, it would be prudent to try and 
ensure that we can make as specific a request to the other speaker as possible 
- this is where mechanisms such as One-Time ORF were suggested, and is a 
similar space to that which would be solved through having the UPDATE-VERSION 
message described in draft-ietf-idr-enhanced-gr if one were to be able to have 
a REFRESH following a "last known good" position. Making a more specific 
request may result in a different UPDATE being generated, which may mean the 
error is fixed, but this is not guaranteed.
- If the selected best path changes on the remote speaker, then it will be 
re-advertised anyway - resulting in another chance to re-parse these NLRI.

So, certainly, if we relax the "correctness" requirement, then it would seem a 
good idea to have some means to be able to consistency check -- it's just a 
matter of ensuring that this is done in an effective and scalable manner, that 
does not affect normal BGP operation.

Cheers,
r.


On 28 Dec 2012, at 19:02, Brian Dickson <[email protected]> wrote:

> Here's a quick question, about possible mechanisms for determining whether or 
> not we need to tear down a session.
> 
> If there is a chance that some NLRI weren't properly decoded:
> 
> - What about requesting (presuming the option was negotiated) a 
> route-refresh, or a "confirm per-AFI-SAFI prefix list"?
> 
> It's an expensive "parity check" but is one way of ensuring that we haven't 
> missed a withdrawn prefix.
> If it isn't in the subsequently received list of prefixes, we missed it and 
> should withdraw it.
> Everything else is a no-op - maybe we missed a new prefix (with new 
> attribute?), but that is what "treat as withdraw" gets you.
> 
> Just trying to make sure the cure isn't worse than the disease, in all 
> situations.
> (Where "cure" is fail to withdraw because we "lost" the withdrawal because of 
> malformed update, or "cure" is tear down the session.)
> 
> Brian
> 
> On Fri, Dec 28, 2012 at 1:04 PM, Chris Hall <[email protected]> wrote:
> Rob Shakir wrote (on Fri 28-Dec-2012 at 13:10 +0000):
> >
> > (re: CCing IDR & GROW)
> >
> > On 28 Dec 2012, at 12:28, Chris Hall wrote:
> >
> > > Rob Shakir wrote (on Thu 27-Dec-2012 at 18:44):
> > >>
> > >> Any comments very welcome (to me or grow@).
> 
> > > I'm afraid I still don't get it :-(  What am I missing ?
> > >
> > > UPDATE Message Length errors are Critical because they:
> > >
> > > (1) "result in cases whereby the NLRI attribute cannot
> > >      be correctly extracted".
> > >
> > > The implication is that a failure to extract all NLRI is Critical.
> > > Is that a requirement ?
> 
> > If the NLRI cannot be determined, then this is a Critical error,
> > yes. I left the wording relatively open on whether this is *all*
> > NLRI, ...
> 
> OK.  So, if the message is broken (in any way):
> 
>   * if no NLRI can be found, that's a Critical Error.
> 
>     That would include the case where some broken
>     attribute upstream of any MP_XXX obscures that
>     MP_XXX.
> 
>   * if some NLRI can be found, that's a Non-Critical error.
> 
>     Such NLRI as can be found are treated-as-withdraw.
> 
>     AND any NLRI that may or may not have been lost, are
>     ignored.
> 
>     AND whatever dangers lost NLRI may pose are covered
>     by the caveats in 4.1.
> 
> Yes ?
> 
> > ... as I am not sure that in the requirements draft we should
> > specify direct solutions to specific issues, to e.g., say how to
> > handle cases where MP_REACH_NLRI and MP_UNREACH_NLRI are in the same
> > message [this is a case that I do not believe is forbidden by
> > rfc2858 - if the working group could clarify whether this is
> > something that we feel the draft needs to handle or can explicitly
> > be omitted, then that would be appreciated].
> 
> The RFCs also allow any mix of IPv4 Unicast Reachable/Unreachable in
> the body of the message, with or without MP_REACH_NLRI and/or
> MP_UNREACH_NLRI.  If we call each of these a "collection of NLRI",
> then there may be between 1 and 4 collections of NLRI in a message
> (ignoring the End-of-RIB pseudo-UPDATE).  It is an error to have more
> than one MP_REACH_NLRI attribute or more than one MP_UNREACH_NLRI
> attribute.  It is not clear to me whether those should be Critical
> Errors -- but while we are relaxing all error checking, I don't see
> why they should be.
> 
> It is possible that most implementations actually only send one
> collection per message... so, as you suggest, on a you-know-and-I-know
> basis, we'all could ignore the "theoretical" issues of more than one
> collection in a message.
> 
> But, this does not matter if the requirements allow lost NLRI to
> simply be ignored.  If the sender sends any MP_XXX first, then there
> is little chance that NLRI will be lost, and things work better.  If
> the sender follows the old rules, and particularly if it sends more
> than one collection at a time, then there is more chance that NLRI
> will be lost.
> 
> > > Later:
> > >
> > >  (2) "All errors whereby the contained NLRI can be
> > >       extracted are referred to as Non-Critical".
> > >
> > > And that includes:
> > >
> > >  (3) "where the length of all path attributes contained
> > >       within the UPDATE does not correspond to the
> > >       total path attribute length."
> > >
> > > That is, at least, more explicit than
> > > draft-ietf-idr-error-handling-03, which glosses over (3).
> > >
> > > But if (3) is non-critical then there is some chance that
> > > some NLRI will not be extracted, which appears to violate
> > > (1) and (2).
> 
> > Disclaimer: As I am sure that my comments previously have made
> > clear, I do not maintain a code base for a BGP daemon/implementation
> > - so please feel free to correct my logic below.
> >
> > I do not believe that (3) implies that the NLRI cannot be correctly
> > found.
> 
> What (3) implies to me is that somewhere amongst the attributes
> something is broken.  As Jakob Heitz succinctly puts it: "Once you
> have a malformed update, NOTHING is certain."  In particular, you
> cannot be certain that all NLRI referred to by the sender can be found
> by the receiver.
> 
> > If the sum of total length is incorrect, then we can still
> > extract the individual attributes - we just find that there is not
> > enough data to fill the overall length we were told and/or we have
> > too much attribute data compared to the total attribute length.
> 
> This is precisely where I think the requirements come unstuck.  There
> is almost no redundancy in the attribute encoding.  And almost all the
> redundancy there is is discarded by the new-form-error-handling.  So,
> if the sum of the attribute lengths is incorrect, then (inter alia)
> the receiver CANNOT know whether the attributes it has extracted are
> the attributes the sender intended to send.  In particular, the
> receiver cannot know it has extracted all NLRI (except in the,
> probably obscure, case of having extracted both MP_REACH_NLRI and
> MP_UNREACH_NLRI).  "Once you have a malformed update, NOTHING is
> certain."
> 
> Conversely, if the sum of attribute lengths is correct, then there is
> a fighting chance that the attributes received are the attributes
> sent.  But, if one of those attributes is (say) nominally an
> ATOMIC_AGGREGATE which is apparently 200 octets long, that might be a
> worry.
> 
> ...
> > > Then (4) "In order to maximise the number of cases whereby the
> > > NLRI attributes [plural, now, BTW] can be reliably extracted
> > > from a received message...".  Ah.  So it is not a Critical
> > > Error if "the NLRI attribute cannot be correctly extracted".
> 
> > No - it is a Critical error if we cannot extract the NLRI. This
> > recommendation is to give an increased chance that the NLRI can be
> > extracted as per the IDR error handling draft. This then (by virtue
> > of resulting in the NLRI being extracted) minimises the number of
> > cases that result in a Critical error.
> 
> By this do you mean "all" NLRI or only "some" NLRI ?  As above.
> 
> > The plural here is to reflect
> > that the existence of >1 type of NLRI attribute.
> 
> That would imply that it should be "NLRI attributes" throughout ?
> Though in most places where the draft speaks of "NLRI attribute" I
> think it means one, some, any or all of the "collections of NLRI" (as
> defined above).
> 
> > > For me the requirement remains "conflicted".  On the one hand it
> > > seems to say that it is a Critical Error if the NLRI cannot be
> > > extracted and parsed.  On the other it seems to say it's OK if
> > > you cannot extract some NLRI.
> 
> > If you'll forgive me for removing a significant proportion of your
> > message, I think that we need to take another step back here. It
> > seems to me that the key question that you are highlighting is "What
> > level of confidence do we need to have before we declare that the
> > NLRI cannot be extracted?" -- do you agree?
> 
> Absolutely.
> 
> > From an operator perspective, I would like to compromise *certainty*
> > for *robustness*. You are right, we are compromising correctness
> > here, we might end up withdrawing an incorrect NLRI and impacting
> > service operation for that prefix - however, it is somewhat
> > preferable to me to withdraw a a subset of the NLRI incorrectly,
> > rather than impact all NLRI in one single action. We clearly need to
> > provide some bounds on how much we compromise the certainty (and
> > live within the realms of possibility, such that we are not just
> > taking a shot in the dark). This is what the definitions of Critical
> > and Non-Critical within the document are intended to provide. Once
> > again I will refer to the requirement that there is a balance
> > between correctness and robustness - rather than a locally risk
> > averse approach that results in harmful wider behaviour.
> 
> Where one can extract the NLRI from a broken message, then
> treat-as-withdraw is AFAICS no worse for the NLRI in question than
> session-reset, and hugely better for all the other NLRI received from
> the peer.
> 
> Where not *all* the NLRI in a message have been extracted, then we
> take a step beyond withdrawing something which should not have been
> withdrawn.  For each of those NLRI the receiver will (unknowingly) do
> one of:
> 
>   (a) continue using a route which the sender has
>       withdrawn;
> 
>   (b) continue to use an out of date version of a route
>       which the sender has changed in some way, possibly
>       materially;
> 
>   (c) to fail to use a new route the sender has now made
>       available.
> 
> Of these (a) looks serious and (b) could be, but certainly the effect
> is different from session-reset; while (c) is, essentially,
> treat-as-withdraw.  If some risk "lost routes" is taken, then the
> impact of that clearly must be weighed against the impact of
> session-reset.  I have not found a discussion of the impact of "lost
> routes" in the draft, so I have no idea what the trade-off is, here.
> 
> For completeness, if all forms and degrees of attribute broken-ness
> are required to be acceptable, then there are cases where the receiver
> cannot know whether *all* NLRI have been extracted, or not.
> 
> > Is it acceptable that we leave this as guidance within the
> > requirements? If not, please could you suggest how the definitions
> > of Critical/Non-Critical could be altered to address your concerns?
> 
> Without a change at the sender end, the problem is that: "Once you
> have a malformed update, NOTHING is certain."
> 
> So, as I said earlier, the fundamental question is:
> 
>   (F) is it OK to continue with a session after processing
>       a message which may have contained some NLRI which
>       could not be extracted ?
> 
> As you say, the question hinges on the "which may have contained".  As
> above, it seems to me that the draft skates over the issue of "lost
> routes".
> 
> It also hinges on the operational impact of such "lost routes", on
> which I am unqualified to pronounce.
> 
> It seems to me there is a range of possibilities:
> 
>   1) at one end of the spectrum, if "lost routes" are to
>      be avoided at all costs, then the rules for parsing
>      attributes need to be tight and without a change at
>      the sender end, what is achievable is constrained.
> 
>   2) if "lost routes" are acceptable in the "theoretical"
>      cases, but you-know-and-I-know that most of the
>      time the case will not arise...
> 
>      ...or "lost routes" are acceptable provided some
>      defined steps are taken to identify as much NLRI
>      as is reasonably possible...
> 
>      ...then the requirements need to (a) make the case
>      and (b) describe the trade-off (so that designs
>      made to follow the requirements can be judged in
>      those terms).
> 
>   3) at the other end of the spectrum, if "lost routes"
>      are always preferable to session-reset, then this
>      opens up an even looser approach to UPDATE message
>      handling, in which practically nothing need cause
>      a session-reset -- ie, practically nothing is a
>      Critical Error.
> 
> And, as I have suggested in previous discussions, it is possible to
> consider case (2) in two parts:
> 
>   a) where the sender is unchanged.
> 
>      In which case the issue is the degree of attribute
>      broken-ness vs the likelihood of "lost routes".
> 
>      If this is viewed as an intermediate step, then
>      perhaps the requirements could err on the side of
>      safety ?
> 
>      Perhaps this can be addressed by a "knob" allowing
>      more or less strictness in the parsing of
>      attributes ?
> 
>   b) where the sender is changed.
> 
>      In which case the issue ought to be moot, because
>      the receiver will then be able to know that there
>      are no "lost routes".
> 
> > I would also appreciate further input from IDR as to whether this is
> > sufficient requirement from GROW to allow a solution document to be
> > written?
> 
> The new requirements specify only Critical (session-reset) and
> Non-Critical (treat-as-withdraw) Errors.  This appears to rule out the
> "attribute discard" option in draft-ietf-idr-error-handling-03.  Is
> that intended ?
> 
> Chris
> 
> _______________________________________________
> Idr mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/idr
>

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to