Re: [GROW] [Idr] Fwd: I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Brian Dickson Fri, 28 Dec 2012 11:02:31 -0800

Here's a quick question, about possible mechanisms for determining whether
or not we need to tear down a session.


If there is a chance that some NLRI weren't properly decoded:

- What about requesting (presuming the option was negotiated) a
route-refresh, or a "confirm per-AFI-SAFI prefix list"?

It's an expensive "parity check" but is one way of ensuring that we haven't
missed a withdrawn prefix.
If it isn't in the subsequently received list of prefixes, we missed it and
should withdraw it.
Everything else is a no-op - maybe we missed a new prefix (with new
attribute?), but that is what "treat as withdraw" gets you.

Just trying to make sure the cure isn't worse than the disease, in all
situations.
(Where "cure" is fail to withdraw because we "lost" the withdrawal because
of malformed update, or "cure" is tear down the session.)

Brian

On Fri, Dec 28, 2012 at 1:04 PM, Chris Hall <[email protected]>wrote:

> Rob Shakir wrote (on Fri 28-Dec-2012 at 13:10 +0000):
> >
> > (re: CCing IDR & GROW)
> >
> > On 28 Dec 2012, at 12:28, Chris Hall wrote:
> >
> > > Rob Shakir wrote (on Thu 27-Dec-2012 at 18:44):
> > >>
> > >> Any comments very welcome (to me or grow@).
>
> > > I'm afraid I still don't get it :-(  What am I missing ?
> > >
> > > UPDATE Message Length errors are Critical because they:
> > >
> > > (1) "result in cases whereby the NLRI attribute cannot
> > >      be correctly extracted".
> > >
> > > The implication is that a failure to extract all NLRI is Critical.
> > > Is that a requirement ?
>
> > If the NLRI cannot be determined, then this is a Critical error,
> > yes. I left the wording relatively open on whether this is *all*
> > NLRI, ...
>
> OK.  So, if the message is broken (in any way):
>
>   * if no NLRI can be found, that's a Critical Error.
>
>     That would include the case where some broken
>     attribute upstream of any MP_XXX obscures that
>     MP_XXX.
>
>   * if some NLRI can be found, that's a Non-Critical error.
>
>     Such NLRI as can be found are treated-as-withdraw.
>
>     AND any NLRI that may or may not have been lost, are
>     ignored.
>
>     AND whatever dangers lost NLRI may pose are covered
>     by the caveats in 4.1.
>
> Yes ?
>
> > ... as I am not sure that in the requirements draft we should
> > specify direct solutions to specific issues, to e.g., say how to
> > handle cases where MP_REACH_NLRI and MP_UNREACH_NLRI are in the same
> > message [this is a case that I do not believe is forbidden by
> > rfc2858 - if the working group could clarify whether this is
> > something that we feel the draft needs to handle or can explicitly
> > be omitted, then that would be appreciated].
>
> The RFCs also allow any mix of IPv4 Unicast Reachable/Unreachable in
> the body of the message, with or without MP_REACH_NLRI and/or
> MP_UNREACH_NLRI.  If we call each of these a "collection of NLRI",
> then there may be between 1 and 4 collections of NLRI in a message
> (ignoring the End-of-RIB pseudo-UPDATE).  It is an error to have more
> than one MP_REACH_NLRI attribute or more than one MP_UNREACH_NLRI
> attribute.  It is not clear to me whether those should be Critical
> Errors -- but while we are relaxing all error checking, I don't see
> why they should be.
>
> It is possible that most implementations actually only send one
> collection per message... so, as you suggest, on a you-know-and-I-know
> basis, we'all could ignore the "theoretical" issues of more than one
> collection in a message.
>
> But, this does not matter if the requirements allow lost NLRI to
> simply be ignored.  If the sender sends any MP_XXX first, then there
> is little chance that NLRI will be lost, and things work better.  If
> the sender follows the old rules, and particularly if it sends more
> than one collection at a time, then there is more chance that NLRI
> will be lost.
>
> > > Later:
> > >
> > >  (2) "All errors whereby the contained NLRI can be
> > >       extracted are referred to as Non-Critical".
> > >
> > > And that includes:
> > >
> > >  (3) "where the length of all path attributes contained
> > >       within the UPDATE does not correspond to the
> > >       total path attribute length."
> > >
> > > That is, at least, more explicit than
> > > draft-ietf-idr-error-handling-03, which glosses over (3).
> > >
> > > But if (3) is non-critical then there is some chance that
> > > some NLRI will not be extracted, which appears to violate
> > > (1) and (2).
>
> > Disclaimer: As I am sure that my comments previously have made
> > clear, I do not maintain a code base for a BGP daemon/implementation
> > - so please feel free to correct my logic below.
> >
> > I do not believe that (3) implies that the NLRI cannot be correctly
> > found.
>
> What (3) implies to me is that somewhere amongst the attributes
> something is broken.  As Jakob Heitz succinctly puts it: "Once you
> have a malformed update, NOTHING is certain."  In particular, you
> cannot be certain that all NLRI referred to by the sender can be found
> by the receiver.
>
> > If the sum of total length is incorrect, then we can still
> > extract the individual attributes - we just find that there is not
> > enough data to fill the overall length we were told and/or we have
> > too much attribute data compared to the total attribute length.
>
> This is precisely where I think the requirements come unstuck.  There
> is almost no redundancy in the attribute encoding.  And almost all the
> redundancy there is is discarded by the new-form-error-handling.  So,
> if the sum of the attribute lengths is incorrect, then (inter alia)
> the receiver CANNOT know whether the attributes it has extracted are
> the attributes the sender intended to send.  In particular, the
> receiver cannot know it has extracted all NLRI (except in the,
> probably obscure, case of having extracted both MP_REACH_NLRI and
> MP_UNREACH_NLRI).  "Once you have a malformed update, NOTHING is
> certain."
>
> Conversely, if the sum of attribute lengths is correct, then there is
> a fighting chance that the attributes received are the attributes
> sent.  But, if one of those attributes is (say) nominally an
> ATOMIC_AGGREGATE which is apparently 200 octets long, that might be a
> worry.
>
> ...
> > > Then (4) "In order to maximise the number of cases whereby the
> > > NLRI attributes [plural, now, BTW] can be reliably extracted
> > > from a received message...".  Ah.  So it is not a Critical
> > > Error if "the NLRI attribute cannot be correctly extracted".
>
> > No - it is a Critical error if we cannot extract the NLRI. This
> > recommendation is to give an increased chance that the NLRI can be
> > extracted as per the IDR error handling draft. This then (by virtue
> > of resulting in the NLRI being extracted) minimises the number of
> > cases that result in a Critical error.
>
> By this do you mean "all" NLRI or only "some" NLRI ?  As above.
>
> > The plural here is to reflect
> > that the existence of >1 type of NLRI attribute.
>
> That would imply that it should be "NLRI attributes" throughout ?
> Though in most places where the draft speaks of "NLRI attribute" I
> think it means one, some, any or all of the "collections of NLRI" (as
> defined above).
>
> > > For me the requirement remains "conflicted".  On the one hand it
> > > seems to say that it is a Critical Error if the NLRI cannot be
> > > extracted and parsed.  On the other it seems to say it's OK if
> > > you cannot extract some NLRI.
>
> > If you'll forgive me for removing a significant proportion of your
> > message, I think that we need to take another step back here. It
> > seems to me that the key question that you are highlighting is "What
> > level of confidence do we need to have before we declare that the
> > NLRI cannot be extracted?" -- do you agree?
>
> Absolutely.
>
> > From an operator perspective, I would like to compromise *certainty*
> > for *robustness*. You are right, we are compromising correctness
> > here, we might end up withdrawing an incorrect NLRI and impacting
> > service operation for that prefix - however, it is somewhat
> > preferable to me to withdraw a a subset of the NLRI incorrectly,
> > rather than impact all NLRI in one single action. We clearly need to
> > provide some bounds on how much we compromise the certainty (and
> > live within the realms of possibility, such that we are not just
> > taking a shot in the dark). This is what the definitions of Critical
> > and Non-Critical within the document are intended to provide. Once
> > again I will refer to the requirement that there is a balance
> > between correctness and robustness - rather than a locally risk
> > averse approach that results in harmful wider behaviour.
>
> Where one can extract the NLRI from a broken message, then
> treat-as-withdraw is AFAICS no worse for the NLRI in question than
> session-reset, and hugely better for all the other NLRI received from
> the peer.
>
> Where not *all* the NLRI in a message have been extracted, then we
> take a step beyond withdrawing something which should not have been
> withdrawn.  For each of those NLRI the receiver will (unknowingly) do
> one of:
>
>   (a) continue using a route which the sender has
>       withdrawn;
>
>   (b) continue to use an out of date version of a route
>       which the sender has changed in some way, possibly
>       materially;
>
>   (c) to fail to use a new route the sender has now made
>       available.
>
> Of these (a) looks serious and (b) could be, but certainly the effect
> is different from session-reset; while (c) is, essentially,
> treat-as-withdraw.  If some risk "lost routes" is taken, then the
> impact of that clearly must be weighed against the impact of
> session-reset.  I have not found a discussion of the impact of "lost
> routes" in the draft, so I have no idea what the trade-off is, here.
>
> For completeness, if all forms and degrees of attribute broken-ness
> are required to be acceptable, then there are cases where the receiver
> cannot know whether *all* NLRI have been extracted, or not.
>
> > Is it acceptable that we leave this as guidance within the
> > requirements? If not, please could you suggest how the definitions
> > of Critical/Non-Critical could be altered to address your concerns?
>
> Without a change at the sender end, the problem is that: "Once you
> have a malformed update, NOTHING is certain."
>
> So, as I said earlier, the fundamental question is:
>
>   (F) is it OK to continue with a session after processing
>       a message which may have contained some NLRI which
>       could not be extracted ?
>
> As you say, the question hinges on the "which may have contained".  As
> above, it seems to me that the draft skates over the issue of "lost
> routes".
>
> It also hinges on the operational impact of such "lost routes", on
> which I am unqualified to pronounce.
>
> It seems to me there is a range of possibilities:
>
>   1) at one end of the spectrum, if "lost routes" are to
>      be avoided at all costs, then the rules for parsing
>      attributes need to be tight and without a change at
>      the sender end, what is achievable is constrained.
>
>   2) if "lost routes" are acceptable in the "theoretical"
>      cases, but you-know-and-I-know that most of the
>      time the case will not arise...
>
>      ...or "lost routes" are acceptable provided some
>      defined steps are taken to identify as much NLRI
>      as is reasonably possible...
>
>      ...then the requirements need to (a) make the case
>      and (b) describe the trade-off (so that designs
>      made to follow the requirements can be judged in
>      those terms).
>
>   3) at the other end of the spectrum, if "lost routes"
>      are always preferable to session-reset, then this
>      opens up an even looser approach to UPDATE message
>      handling, in which practically nothing need cause
>      a session-reset -- ie, practically nothing is a
>      Critical Error.
>
> And, as I have suggested in previous discussions, it is possible to
> consider case (2) in two parts:
>
>   a) where the sender is unchanged.
>
>      In which case the issue is the degree of attribute
>      broken-ness vs the likelihood of "lost routes".
>
>      If this is viewed as an intermediate step, then
>      perhaps the requirements could err on the side of
>      safety ?
>
>      Perhaps this can be addressed by a "knob" allowing
>      more or less strictness in the parsing of
>      attributes ?
>
>   b) where the sender is changed.
>
>      In which case the issue ought to be moot, because
>      the receiver will then be able to know that there
>      are no "lost routes".
>
> > I would also appreciate further input from IDR as to whether this is
> > sufficient requirement from GROW to allow a solution document to be
> > written?
>
> The new requirements specify only Critical (session-reset) and
> Non-Critical (treat-as-withdraw) Errors.  This appears to rule out the
> "attribute discard" option in draft-ietf-idr-error-handling-03.  Is
> that intended ?
>
> Chris
>
> _______________________________________________
> Idr mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/idr
>

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] Fwd: I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to