Re: [GROW] [Idr] Fwd: I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Jakob Heitz Fri, 28 Dec 2012 13:35:18 -0800

Treat-as-withdraw is not a cure.
Tear down the session is not a cure.

All we can hope for after a malformed update is a
temporary mitigation until human intervention can
fix the problem.

IMO, the goal of error handling is to limit the
damage, not to cure the problem.

--
Jakob Heitz.

________________________________
From: [email protected] [mailto:[email protected]] On Behalf Of Brian 
Dickson
Sent: Friday, December 28, 2012 11:02 AM
To: Chris Hall
Cc: [email protected]; [email protected]
Subject: Re: [GROW] [Idr] Fwd: I-D Action: 
draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Here's a quick question, about possible mechanisms for determining whether or 
not we need to tear down a session.

If there is a chance that some NLRI weren't properly decoded:

- What about requesting (presuming the option was negotiated) a route-refresh, 
or a "confirm per-AFI-SAFI prefix list"?

It's an expensive "parity check" but is one way of ensuring that we haven't 
missed a withdrawn prefix.
If it isn't in the subsequently received list of prefixes, we missed it and 
should withdraw it.
Everything else is a no-op - maybe we missed a new prefix (with new 
attribute?), but that is what "treat as withdraw" gets you.

Just trying to make sure the cure isn't worse than the disease, in all 
situations.
(Where "cure" is fail to withdraw because we "lost" the withdrawal because of 
malformed update, or "cure" is tear down the session.)

Brian

On Fri, Dec 28, 2012 at 1:04 PM, Chris Hall 
<[email protected]<mailto:[email protected]>> wrote:
Rob Shakir wrote (on Fri 28-Dec-2012 at 13:10 +0000):
>
> (re: CCing IDR & GROW)
>
> On 28 Dec 2012, at 12:28, Chris Hall wrote:
>
> > Rob Shakir wrote (on Thu 27-Dec-2012 at 18:44):
> >>
> >> Any comments very welcome (to me or grow@).

> > I'm afraid I still don't get it :-(  What am I missing ?
> >
> > UPDATE Message Length errors are Critical because they:
> >
> > (1) "result in cases whereby the NLRI attribute cannot
> >      be correctly extracted".
> >
> > The implication is that a failure to extract all NLRI is Critical.
> > Is that a requirement ?

> If the NLRI cannot be determined, then this is a Critical error,
> yes. I left the wording relatively open on whether this is *all*
> NLRI, ...

OK.  So, if the message is broken (in any way):

  * if no NLRI can be found, that's a Critical Error.

    That would include the case where some broken
    attribute upstream of any MP_XXX obscures that
    MP_XXX.

  * if some NLRI can be found, that's a Non-Critical error.

    Such NLRI as can be found are treated-as-withdraw.

    AND any NLRI that may or may not have been lost, are
    ignored.

    AND whatever dangers lost NLRI may pose are covered
    by the caveats in 4.1.

Yes ?

> ... as I am not sure that in the requirements draft we should
> specify direct solutions to specific issues, to e.g., say how to
> handle cases where MP_REACH_NLRI and MP_UNREACH_NLRI are in the same
> message [this is a case that I do not believe is forbidden by
> rfc2858 - if the working group could clarify whether this is
> something that we feel the draft needs to handle or can explicitly
> be omitted, then that would be appreciated].

The RFCs also allow any mix of IPv4 Unicast Reachable/Unreachable in
the body of the message, with or without MP_REACH_NLRI and/or
MP_UNREACH_NLRI.  If we call each of these a "collection of NLRI",
then there may be between 1 and 4 collections of NLRI in a message
(ignoring the End-of-RIB pseudo-UPDATE).  It is an error to have more
than one MP_REACH_NLRI attribute or more than one MP_UNREACH_NLRI
attribute.  It is not clear to me whether those should be Critical
Errors -- but while we are relaxing all error checking, I don't see
why they should be.

It is possible that most implementations actually only send one
collection per message... so, as you suggest, on a you-know-and-I-know
basis, we'all could ignore the "theoretical" issues of more than one
collection in a message.

But, this does not matter if the requirements allow lost NLRI to
simply be ignored.  If the sender sends any MP_XXX first, then there
is little chance that NLRI will be lost, and things work better.  If
the sender follows the old rules, and particularly if it sends more
than one collection at a time, then there is more chance that NLRI
will be lost.

> > Later:
> >
> >  (2) "All errors whereby the contained NLRI can be
> >       extracted are referred to as Non-Critical".
> >
> > And that includes:
> >
> >  (3) "where the length of all path attributes contained
> >       within the UPDATE does not correspond to the
> >       total path attribute length."
> >
> > That is, at least, more explicit than
> > draft-ietf-idr-error-handling-03, which glosses over (3).
> >
> > But if (3) is non-critical then there is some chance that
> > some NLRI will not be extracted, which appears to violate
> > (1) and (2).

> Disclaimer: As I am sure that my comments previously have made
> clear, I do not maintain a code base for a BGP daemon/implementation
> - so please feel free to correct my logic below.
>
> I do not believe that (3) implies that the NLRI cannot be correctly
> found.

What (3) implies to me is that somewhere amongst the attributes
something is broken.  As Jakob Heitz succinctly puts it: "Once you
have a malformed update, NOTHING is certain."  In particular, you
cannot be certain that all NLRI referred to by the sender can be found
by the receiver.

> If the sum of total length is incorrect, then we can still
> extract the individual attributes - we just find that there is not
> enough data to fill the overall length we were told and/or we have
> too much attribute data compared to the total attribute length.

This is precisely where I think the requirements come unstuck.  There
is almost no redundancy in the attribute encoding.  And almost all the
redundancy there is is discarded by the new-form-error-handling.  So,
if the sum of the attribute lengths is incorrect, then (inter alia)
the receiver CANNOT know whether the attributes it has extracted are
the attributes the sender intended to send.  In particular, the
receiver cannot know it has extracted all NLRI (except in the,
probably obscure, case of having extracted both MP_REACH_NLRI and
MP_UNREACH_NLRI).  "Once you have a malformed update, NOTHING is
certain."

Conversely, if the sum of attribute lengths is correct, then there is
a fighting chance that the attributes received are the attributes
sent.  But, if one of those attributes is (say) nominally an
ATOMIC_AGGREGATE which is apparently 200 octets long, that might be a
worry.

...
> > Then (4) "In order to maximise the number of cases whereby the
> > NLRI attributes [plural, now, BTW] can be reliably extracted
> > from a received message...".  Ah.  So it is not a Critical
> > Error if "the NLRI attribute cannot be correctly extracted".

> No - it is a Critical error if we cannot extract the NLRI. This
> recommendation is to give an increased chance that the NLRI can be
> extracted as per the IDR error handling draft. This then (by virtue
> of resulting in the NLRI being extracted) minimises the number of
> cases that result in a Critical error.

By this do you mean "all" NLRI or only "some" NLRI ?  As above.

> The plural here is to reflect
> that the existence of >1 type of NLRI attribute.

That would imply that it should be "NLRI attributes" throughout ?
Though in most places where the draft speaks of "NLRI attribute" I
think it means one, some, any or all of the "collections of NLRI" (as
defined above).

> > For me the requirement remains "conflicted".  On the one hand it
> > seems to say that it is a Critical Error if the NLRI cannot be
> > extracted and parsed.  On the other it seems to say it's OK if
> > you cannot extract some NLRI.

> If you'll forgive me for removing a significant proportion of your
> message, I think that we need to take another step back here. It
> seems to me that the key question that you are highlighting is "What
> level of confidence do we need to have before we declare that the
> NLRI cannot be extracted?" -- do you agree?

Absolutely.

> From an operator perspective, I would like to compromise *certainty*
> for *robustness*. You are right, we are compromising correctness
> here, we might end up withdrawing an incorrect NLRI and impacting
> service operation for that prefix - however, it is somewhat
> preferable to me to withdraw a a subset of the NLRI incorrectly,
> rather than impact all NLRI in one single action. We clearly need to
> provide some bounds on how much we compromise the certainty (and
> live within the realms of possibility, such that we are not just
> taking a shot in the dark). This is what the definitions of Critical
> and Non-Critical within the document are intended to provide. Once
> again I will refer to the requirement that there is a balance
> between correctness and robustness - rather than a locally risk
> averse approach that results in harmful wider behaviour.

Where one can extract the NLRI from a broken message, then
treat-as-withdraw is AFAICS no worse for the NLRI in question than
session-reset, and hugely better for all the other NLRI received from
the peer.

Where not *all* the NLRI in a message have been extracted, then we
take a step beyond withdrawing something which should not have been
withdrawn.  For each of those NLRI the receiver will (unknowingly) do
one of:

  (a) continue using a route which the sender has
      withdrawn;

  (b) continue to use an out of date version of a route
      which the sender has changed in some way, possibly
      materially;

  (c) to fail to use a new route the sender has now made
      available.

Of these (a) looks serious and (b) could be, but certainly the effect
is different from session-reset; while (c) is, essentially,
treat-as-withdraw.  If some risk "lost routes" is taken, then the
impact of that clearly must be weighed against the impact of
session-reset.  I have not found a discussion of the impact of "lost
routes" in the draft, so I have no idea what the trade-off is, here.

For completeness, if all forms and degrees of attribute broken-ness
are required to be acceptable, then there are cases where the receiver
cannot know whether *all* NLRI have been extracted, or not.

> Is it acceptable that we leave this as guidance within the
> requirements? If not, please could you suggest how the definitions
> of Critical/Non-Critical could be altered to address your concerns?

Without a change at the sender end, the problem is that: "Once you
have a malformed update, NOTHING is certain."

So, as I said earlier, the fundamental question is:

  (F) is it OK to continue with a session after processing
      a message which may have contained some NLRI which
      could not be extracted ?

As you say, the question hinges on the "which may have contained".  As
above, it seems to me that the draft skates over the issue of "lost
routes".

It also hinges on the operational impact of such "lost routes", on
which I am unqualified to pronounce.

It seems to me there is a range of possibilities:

  1) at one end of the spectrum, if "lost routes" are to
     be avoided at all costs, then the rules for parsing
     attributes need to be tight and without a change at
     the sender end, what is achievable is constrained.

  2) if "lost routes" are acceptable in the "theoretical"
     cases, but you-know-and-I-know that most of the
     time the case will not arise...

     ...or "lost routes" are acceptable provided some
     defined steps are taken to identify as much NLRI
     as is reasonably possible...

     ...then the requirements need to (a) make the case
     and (b) describe the trade-off (so that designs
     made to follow the requirements can be judged in
     those terms).

  3) at the other end of the spectrum, if "lost routes"
     are always preferable to session-reset, then this
     opens up an even looser approach to UPDATE message
     handling, in which practically nothing need cause
     a session-reset -- ie, practically nothing is a
     Critical Error.

And, as I have suggested in previous discussions, it is possible to
consider case (2) in two parts:

  a) where the sender is unchanged.

     In which case the issue is the degree of attribute
     broken-ness vs the likelihood of "lost routes".

     If this is viewed as an intermediate step, then
     perhaps the requirements could err on the side of
     safety ?

     Perhaps this can be addressed by a "knob" allowing
     more or less strictness in the parsing of
     attributes ?

  b) where the sender is changed.

     In which case the issue ought to be moot, because
     the receiver will then be able to know that there
     are no "lost routes".

> I would also appreciate further input from IDR as to whether this is
> sufficient requirement from GROW to allow a solution document to be
> written?

The new requirements specify only Critical (session-reset) and
Non-Critical (treat-as-withdraw) Errors.  This appears to rule out the
"attribute discard" option in draft-ietf-idr-error-handling-03.  Is
that intended ?

Chris

_______________________________________________
Idr mailing list
[email protected]<mailto:[email protected]>
https://www.ietf.org/mailman/listinfo/idr

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] Fwd: I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to