Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Chris Hall Wed, 02 Jan 2013 08:02:29 -0800

Jeff Wheeler wrote (on Mon 31-Dec-2012 at 20:36 +0000):
....
> Like I keep saying, the goal of error-handling should be to make
> session-reset avoidable.  Today it is not.  It is creating complex
> rules which must be supported by both sides of the BGP session to
> try very hard to keep the network in a good state.


It is a truth universally acknowledged (AFAICS), that if NLRI in a
broken UPDATE are treated-as-withdraw, that is no worse than
session-reset and much to be preferred.

So, treat-as-withdraw is a reasonable thing for any implementation to
do, by default, where it can.

The problem is that when things are broken, it may not be possible to
identify all the NLRI -- some may be "lost".  [At this point I
recommend: "The Engineer", AA Milne.]

The effects of "lost NLRI" range from "ho hum" to "arghh", depending.
The possibility of "lost NLRI" ranges from "small, I think" through
"dunno" to "don't care".

The incidence of "lost NLRI" can be reduced by some simple rule
changes on the ordering of attributes.

But, there is a desire to improve things without changes at the sender
end.  Also, there is a desire to avoid session-reset at (almost) all
costs -- right up to the point where sessions are never reset
(re-syncing on 16 x 0xFF as required).  So, since we cannot eliminate
"lost NLRI" let us learn to love them (or at least learn some
tolerance).

The risks posed by "lost NLRI" are unknown, and the effects context
dependent.  Given the uncertainty, I do not think it is reasonable to
accept anything more than a (vanishingly) small risk of "lost NLRI",
by default.  So, AFAICS, we are in the land of the knob.

Suppose three categories of error in an UPDATE message:

  * Non-Critical -- ie treat-as-withdraw or otherwise

    in general terms: all NLRI accounted for.

    More accurately: the risk of "lost NLRI" is deemed
    negligible.

  * Critical -- ie bad, but not session-reset

    in general terms: "lost NLRI", ie:

      either: there is reason to believe that there
        are "lost NLRI" (eg. a broken MP_UNREACH_NLRI
        attribute)

      or: it is not sufficiently clear that all NLRI
        are accounted for.

    NLRI that can be accounted for may be treated in
    a Non-Critical sort of a way, but some other
    response(s) may be triggered to mitigate the
    effect of "lost NLRI".

   * Fatal -- ie session-reset

     things are FUBAR -- by some definition.

     (Graceful Restart and enhancements thereof may blur
      the line between Fatal and Critical errors.  But
      that's another story.) 

The message is broken... things are uncertain, so: a key knob is the
one which allows the operator to select criteria for Criticality --
anything less is Non-Critical, anything more is Fatal.  Other knobs
may select for specific responses to different forms of Critical and
Non-Critical errors.

This is all starting to look complicated :-(  But at least it avoids
trying to square the circle.  [When you have eliminated the
impossible, what remains, ....]

At a practical level, I observe:

  1) if (or as) most implementations only send one
     collection of NLRI per message, then extracting
     one collection is enough to be reasonably sure
     there are no "lost NLRI".

  2) the worst case of "lost NLRI" is lost Withdrawn
     NLRI... which, per (1), are going to be the
     only NLRI in the message, and hence unlikely
     to be lost.

So, a knob or two can allow the operator to settle on what works best
for them, and allow the implementer to provide stuff which an adult
operator may use in the privacy of their own network.  And, one can
envisage Emergency Knobs -- for when there is limited time to Save the
World, and being fastidious about the specification just gets in the
way.

Stepping back from the minutiae of unpacking UPDATE messages, in
routeing terms the above categories are (I think):

  * Non-Critical:

      - some routes which the sender has offered have been
        filtered out, because the attributes were garbled,
        (ie treat-as-withdraw),

      - some routes which the sender offered had partly
        invalid attributes, but they have been accepted,
        in some form (ie other mechanisms),

      - BUT the routes which remain are *valid*,

      - AND the receiver knows which prefixes have been
        affected.

  * Critical:

      - some routes which the sender has offered are not
        available to the receiver,

      - some routes which should have been withdrawn
        may still be in use,

      - some routes whose attributes should have been
        changed may still be in use,

      - AND the receiver does not know which of the
        routes which remain are in doubt.

   * Fatal -- FUBAR

Incidentally, in the absence of any better, novel mechanism: having
reset and restarted a session, there is (much) less reason to worry
about "lost NLRI" -- at least until End-of-RIB rolls up (or some
time-out).  But in any case, some things which are Fatal in normal
running could be downgraded, for some period, after a session-reset ?

Happy New Year,

Chris

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] I-D Action: draft-ietf-grow-ops-reqs-for-bgp-error-handling-06.txt

Reply via email to