Re: [GROW] [Idr] draft-ietf-grow-ops-reqs-for-bgp-error-handling-05

John Leslie Fri, 31 Aug 2012 12:09:55 -0700

Chris Hall <[email protected]> wrote:
> 
> In trying to classify Errors with BGP-4 UPDATE Messages I think it
> would be useful to distinguish between the form of an error and the
> severity of that error and how BGP should respond.


   I welcome such a discussion!

> It seems to me that there are four severities/responses:
> 
>   1) "Critical Error" -> drop/restart session or AFI/SAFI
> 
>      So the overall response then depends on how gracefully the
>      session drop and/or restart can be handled.

   BGP was designed for a naive case where all your peers were either
dependable or broken. Moving beyond that has proven difficult. :^(

   But we should recognize the "fully broken" case, which amounts to
"drop the session and be in no hurry to try that peer again".

   Arguably, this should even require human intervention before trying
another session, but I have no horse in that race...

   It is indeed interesting to consider that one AFI/SAFI may be
"fully broken" while others seem dependable: perhaps we should consider
dropping only the part of a session concerning that AFI/SAFI...

>   2) "Serious Error" -> do something with NLRI, short of
>      dropping the session.

   This is a bit vague...

>      The "treat-as-withdraw" mechanism is mentioned.

   This is appropriate when we have reason to believe the peer is
dependable, but is passing us NLRI routing information we're not
willing to trust.  IMHO, "not willing to trust" is in the eye of the
beholder...

   The default should be: we won't send such blocks to that peer
because we don't trust what they'd do with it. However, we might
indeed want to send it to that peer if we have no other route to
that destinatin. Thus, it seems it should actually be a disincentive
(possibly a binary disincentive, but presumably usually closer to
a nasty localpref).

   In brief, I'm not sure "treat-as-withdraw" is the right name...

>      The requirements obviously do not wish to specify mechanisms.
> 
>      But I think that the requirements should address what outcome
>      is expected if errors in an individual UPDATE message are to
>      be limited to that message.  I think what that means is:
> 
>        * it must be possible to identify all NLRI that the message
>          could be carrying.

   I read this to mean that all length+prefix must be readable
(but not necessarily considered valid), whether or not the path
attributes make sense.

   There is an inevitable grey area when the "total path attribute
length" looks dubious but not downright illegal. IMHO, we cannot
usefully standardize "dubious" here -- if we try, we'll find that
implementations differ anyway.

   So this statement sounds fine to me...

>        * whatever is done with those NLRI must reflect the fact
>          that the recipient has an incomplete, possibly empty,
>          set of attributes for those NLRI.

   I take this to mean there's something about the attributes we
don't choose to trust: thus our picture of the "usable" attributes
differs from that of our peer. (Indeed, we may choose to trust
none of them.)

>   3) "Ignorable Error" -> process the UPDATE message as if the
>      ignored attributes

   Some words seem to have gotten lost in tranport...

>      Some errors in some trivial attributes may be ignorable.
>      The requirements could cover the criteria for being deemed
>      trivial.

   "trivial" may be an unfortunate word...

>      Some errors in Optional Transitive may be dealt with by
>      ignoring the attribute altogether.  The requirements
>      mention this, but do not specify criteria for being
>      ignorable.

   "Optional Transitive" is well-defined. Perhaps some advertised
capabilities are less well-defined...

>   4) "Recoverable Error" -> process the UPDATE message which has
>      had errors "patched up".

   This sounds dangerous.

   The path attributes are expected to reflect how our peer came
upon this route. It used to be the case that regardless of the
attributes, we could be sure the peer would _use_ this route; but
we've been chipping away at that assurance.

   I'm very uncomfortable about "patching up" UPDATE information
and passing it to other peers without being fully clear about
whether there's a sufficiently long path that actually _uses_
the route.

>      The draft-ieft-idr-error-handling, for example, suggests
>      that invalid Attribute Flags may simply be overwritten
>      by the expected value.

   I confess to being less than comfortable with idr-error-handling.
So I'll let others discuss how wise this is...

> I would then divide the forms of error into (1) "framing" and (2)
> "content" (or "semantic").
> 
> A BGP UPDATE message has three levels of framing:
> 
>   * Level 1 -- the 16 octet "Marker" + Message Length
>                                      + Withdrawn Routes Length
>                                      + Total Path Attributes Length
> 
>     If the Message Length is broken, it is extremely likely that the
>     "Marker" on the next message will be invalid.

   This sort of error clearly qualifies as "broken peer".

>   * Level 2(a) -- the Withdrawn Routes
> 
>     Each prefix must have a valid prefix length, and the last
>     must run exactly to the end of this part of the message.

   This would appear to qualify as "broken peer"

>   * Level 2(b) -- the Attributes
> 
>     Each attribute must be correctly framed, and at the end of the
>     attributes the last one must run to exactly the end of the
>     attribute part of the message.

   To the extent that type-length-value triplets should match total
path attribute length, this failure would indicate "broken peer".

>   * Level 2(c) -- the Network Layer Reachability Information.
> 
>     Same as 2(a).
> 
>   * Level 3 -- various Attributes
> 
>     Some attributes have internal framing.

   Indeed!

> So far, so obvious.  To judge if an individual attribute is properly
> framed, we need to consider the red-tape:
> 
>   * the Flags octet has a limited set of valid values, depending
>     on the Type.

   True, but probably not all BGP speakers check all of these...

>   * the Type may be more or less anything, but repeats are not
>     valid.

   Again, not all BGP speakers necessarily check this...

>   * the Length is constrained for some Types

   I suspect such checks are rather spotty...

> There is some redundancy here, more for known types than unknown ones,
> which helps.  The Total Path Attributes Length is, effectively, a
> checksum for all the Lengths of all the Attributes.  It would be
> possible to specify that a set of attributes should be deemed
> correctly framed solely on the basis of passing that test.

   This feels like the wrong question. Failing that test does indicate
a broken peer, but passing it doesn't prove the attributes safe to
send to a peer.

> However, my feeling is that all the available redundancy (such as
> it is) should be used to minimise the possibility of accepting a
> broken attributes
> -- *particularly* where an error is going to be treated as Ignorable.

   Yes, I am uncomfortable here.

   It's probably a good practice to check such minutiae (even for
attributes you know you don't care about), but it's foolish to hope
we can get such checking to be uniform.

   And it's not clear what actual harm would come from inconsistent
checking of such minutiae...

> Once attributes are correctly framed, then one can consider their
> content.  Wherever the line between framing and content is drawn, I
> think it helps to be clear about the distinction between them --
> "framing" errors affect the attribute and the attributes around it,
> "content" errors affect only the attribute.

   I like that distinction (though my primary concern is distinguishing
a broken peer from a careless peer).

> The framing of an Optional Transitive is a special case.  If the
> parser recognises an Optional Transitive, but its Length is not valid,
> what should the receiver do ?

   It depends on what the meaning of "not valid" is...

> If the sender did not understand the
> Attribute, then the broken Length is a "content" issue.  If the sender
> did understand it, then the broken Length is a "framing" issue.  (It
> is a serious disappointment to me that the Partial bit does not help
> here.  But even if it did, what if the sender made a mess of
> setting/clearing it !?)

   As I've said for years, "Error-correction is the most error-prone
operation in computing." :^(

> In section 2.1.2 the draft specifies a number of "Semantic BGP
> Errors", which includes many things which I would class as "framing"
> errors.

   Indeed, I prefer to think in terms of "framing".

> This is all pretty low level stuff.  I can hear an argument that the
> requirements document is not the place for this level of detail.
> However, without a more precise understanding of how broken attributes
> may be parsed, requirements for how to deal with them are hard to
> specify and to interpret.

   Especially if we try to "correct" errors!

> If NLRI were explicitly separate from the attributes, then if a set of
> attributes fails a strict "framing" check, then "treat-as-withdraw"
> (or equivalent) could be applied, reliably.  This seems to me to be as
> safe as possible, short of dropping the session (which has its own
> safety issues).
> 
> With NLRI mixed up in the attributes, either one plays safe and treats
> all attribute errors as Critical, or a much more detailed analysis of
> attribute parsing is required.  What is the cost of missing some NLRI
> which were sent, but were obscured by some other broken attribute ?
> What is the risk ?  What degree of broken-ness of an attribute can be
> deemed not to invalidate the parsing of the attributes before and/or
> after it ?  Is that different for different attributes ?

   This deserves discussion...

> In order to contemplate classifying some attribute errors as
> "Ignorable" or "Recoverable", a more detailed analysis of attribute
> parsing is also required.  An ATOMIC_AGGREGATE attribute is arguably
> trivial and Ignorable.  But is an ATOMIC_AGGREGATE attribute with a
> length of 421 (say) likely to be a momentary lapse of concentration at
> the sender end, or more likely to be a symptom of a badly broken set
> of attributes ? 

   That's something I'd guess most peers can't be bothered to check.

--
John Leslie <[email protected]>
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] [Idr] draft-ietf-grow-ops-reqs-for-bgp-error-handling-05

Reply via email to