Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Chris Hall Wed, 02 Jan 2013 12:32:14 -0800

Rob Shakir wrote (on Wed 02-Jan-2013 at 11:39 +0000):
> On 1 Jan 2013, at 17:27, Chris Hall wrote:
> > [snip]
 
> I think this is a good summary of the different approaches. In ops-
> reqs-for-bgp-error-handling-06, there is no category of "fatal"
> essentially because (as you highlight) the line between the fatal
> and the critical cases is somewhat blurry. I would propose that we
> do not add another category of error for "fatal".


OK.  Old-fashioned session-reset (unadorned by Graceful Restart or
other mitigation, past present or future) is Bad.  The Fatal category
would capture cases where an old-fashioned session-reset is the only
available response.  As old-fashioned session-reset fades into distant
memory, that distinction will be less useful.

I think that what this boils down to is:

  Non-Critical => per-Message response is sufficient

  Critical     => per-Session (or possibly per-AFI/SAFI)
                  response is required.

The anatomy of per-Message response may include:

  1) treat-as-withdraw all NLRI that can be identified

  2) scrubbing round some attribute errors and proceeding
     with some or all announced NLRI.

  3) other (novel) means to recover the state of particular
     NLRI.

  4) (implicitly) ignoring any known or unknown "lost-NLRI".

The anatomy of per-Session response may include:

  a) old-fashioned session-reset, where nothing else is
     available.

  b) complete session-drop, where patience is exhausted.

  c) other means, old and new, to restore the session,
     to some level of health.

  d) some means to control cycles of repeated errors
     generating the same (inadequate) response.

     [Definition of madness: doing the same thing over
      and over again in the expectation of a different
      outcome.] 

It seems to me that there are two contexts for error-handling: (i) in
normal running, and (ii) during error-recovery.

Things which are deemed Critical in normal running may be deemed
Non-Critical during error-recovery.  That may be part of the automatic
per-Session response to an error, or may be some override settable by
the Operator.

What an operator deems Non-Critical in normal running will depend on
their assessment of the risk/impact of "lost NLRI" and the impact/cost
of the available per-Session response(s).  The risk of "lost NLRI"
depends on sender behaviour and on the acceptable
"degree-of-broken-ness".  The impact of "lost NLRI" is context
dependent.  The impact/cost of any per-Session response depends on
what is supported by both ends of the session.

My conclusion is that there is no "one size fits all" allocation of
errors to class of error.  Hence, the requirement is for operator
control over error classification and over error response/recovery --
on a per Session basis -- both for normal running and error-recovery.

Having got this far... I'm tempted to back away from talking about
Critical/Non-Critical *Errors*, and talk, instead, about
Message-Level/Session-Level *Recovery*... not much of the existing
draft would be affected if the focus shifted from Errors to Recovery.

.....
> I would suggest that adding the following wording to § 3 of the
> draft addresses this, and clarifies the issue of "lost" NLRI:
> 
> "An error SHOULD be defined as Non-Critical if at least one NLRI
> attribute within an erroneous message can be successfully parsed. In
> cases where more than one attribute containing NLRI is included
> within a single UPDATE message, this may result in cases where some
> NLRI contained within subsequent attributes are missed, particularly
> where length errors exist in the message. In order to minimise the
> risk of such occurrences, it is recommended that an implementation
> SHOULD include only one attribute containing NLRI per message."

The way in which attributes and NLRI are packed in a (current) UPDATE
message is unhelpful.  But, I don't think this is the place to solve
that problem.

Further, changes to the specification of UPDATE messages cannot ensure
that software will follow that (or any other) specification.  After
all, we are only here because software is less than perfect !  And,
there is some desire to cope with errors which cannot be addressed by
any amount of specification-tweaking.

And, there is the need to do better without changes at the sender end.

Hence, IMHO the question of what should be handled at the
Message-Level, and what should be handled at the Session-Level, is
best decided by the operator... as above.

Further, this approach would allow the requirements to avoid the
quicksand which is UPDATE message parsing.  [Huzzah !]

-----

The requirements could recommend new defaults for the classification
of errors.  There's no absolute need for this: given the ability to
choose something which suits them better, I'm sure operators will
happily enable what they want.

I guess any *default* would err on the side of caution, ie:
Message-Level Recovery is appropriate only where there is a
(vanishingly) small risk of "lost NLRI".

The appropriate default for a given session may depend on the
behaviour of the peer, which may be the subject of negotiation,
configuration or sweeping generalisation.

The requirements could recommend that UPDATE messages should be
altered to reduce the risk of "lost NLRI" or be generally more robust.
If that is constrained by the need for new-form UPDATE messages to be
downwards compatible with existing ones, the requirements should
mention that.  If the default classification depends on new sender
behaviour, I think that implies the requirement for a C.... but I
won't repeat that heresy :-)

I'm skirting around the quicksand here, I don't want to get sucked
down again...

Chris 

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to