John Leslie wrote (on Fri 31-Aug-2012 at 15:54 +0100): > Chris Hall <[email protected]> wrote: ... > > It seems to me that there are four severities/responses: > > > > 1) "Critical Error" -> drop/restart session or AFI/SAFI > > > > So the overall response then depends on how gracefully the > > session drop and/or restart can be handled.
> BGP was designed for a naive case where all your peers were > either dependable or broken. Moving beyond that has proven > difficult. :^( Certainly I have found it tricky when trying to do some of the things in draft-ietf-idr-error-handling in Quagga ! > But we should recognize the "fully broken" case, which amounts to > "drop the session and be in no hurry to try that peer again". One of my concerns with improving error handling is that however inconvenient dropping an entire session may be, and how disruptive repeatedly dropping and restarting sessions is, at least it draws attention to a problem. If the error handling becomes too smooth, on the other hand may hide problems. > Arguably, this should even require human intervention before > trying another session, but I have no horse in that race... > > It is indeed interesting to consider that one AFI/SAFI may be > "fully broken" while others seem dependable: perhaps we should > consider dropping only the part of a session concerning that > AFI/SAFI... So, I guess by "Critical" what one really means is some error which is so severe that all NLRI (possibly only for a subset of AFI/SAFI) learned during the session are suspect. Suspect because the current UPDATE message refers to some NLRI (possibly in a subset of AFI/SAFI for the session) are affected by the broken UPDATE, but it's not possible to be sure which ones. > > 2) "Serious Error" -> do something with NLRI, short of > > dropping the session. > This is a bit vague... So, I think the distinction between Critical and Serious is that for a Serious error it is possible to identify the NLRI which the broken UPDATE refers to, but not possible to identify enough of the Attributes to be sure of the new state of those NLRI. So, whatever the problem is, it is limited to these NLRI, and all other NLRI may be assumed to be unchanged. What degree of proof is required to be sure (or sure enough) of being able to identify the NLRI is the crux of the matter when it comes to discussing how the parsing of attributes should proceed in the face of errors. > > The "treat-as-withdraw" mechanism is mentioned. > This is appropriate when we have reason to believe the peer is > dependable, but is passing us NLRI routing information we're not > willing to trust. IMHO, "not willing to trust" is in the eye of > the beholder... It would clearly be possible to treat all errors as prima facie evidence that the peer has left the reservation and is unlikely to return to their senses any time soon, so that the only sensible response is to forget all their routes. So, yes, I think you are correctly observing that there are a number of possible responses to each of the different severities of error. The distinction is that with a Serious error the response *can* be limited to the NLRI in the UPDATE, but for a Critical error it cannot. ... > > But I think that the requirements should address what outcome > > is expected if errors in an individual UPDATE message are to > > be limited to that message. I think what that means is: > > > > * it must be possible to identify all NLRI that the message > > could be carrying. > I read this to mean that all length+prefix must be readable > (but not necessarily considered valid), whether or not the path > attributes make sense. I think that this requirement is best met by separating NLRI from Attributes. Then, if all the NLRI have valid lengths, and those add up to the total length of the NLRI section(s), then you have identified the NLRI. For me, anything short of that should be deemed a Critical error. ... > > * whatever is done with those NLRI must reflect the fact > > that the recipient has an incomplete, possibly empty, > > set of attributes for those NLRI. > I take this to mean there's something about the attributes we > don't choose to trust: thus our picture of the "usable" attributes > differs from that of our peer. (Indeed, we may choose to trust > none of them.) Yes. > > 3) "Ignorable Error" -> process the UPDATE message as if the > > ignored attributes > Some words seem to have gotten lost in tranport... :-( ...as if the ignored attributes were not there. ... > > Some errors in Optional Transitive may be dealt with by > > ignoring the attribute altogether. The requirements > > mention this, but do not specify criteria for being > > ignorable. > "Optional Transitive" is well-defined. Perhaps some advertised > capabilities are less well-defined... What I was trying to get at is this: clearly a BGP implementation which does not understand a given Optional Transitive attribute is able to use the route(s) without it. Perhaps this means that if an implementation which does understand the attribute, but finds that it is not valid in some way, could simply pretend not to understand it ? Or, does some greater responsibility come with greater knowledge ? > > 4) "Recoverable Error" -> process the UPDATE message which has > > had errors "patched up". > This sounds dangerous. Amen to that. ... > > So far, so obvious. To judge if an individual attribute is > > properly framed, we need to consider the red-tape: > > > > * the Flags octet has a limited set of valid values, depending > > on the Type. > True, but probably not all BGP speakers check all of these... > > * the Type may be more or less anything, but repeats are not > > valid. > Again, not all BGP speakers necessarily check this... > > * the Length is constrained for some Types > I suspect such checks are rather spotty... Hmmm. I can only speak for the Quagga stuff I am working on, which works through the attributes it knows about with a fine tooth comb. ... > It's probably a good practice to check such minutiae (even for > attributes you know you don't care about), but it's foolish to hope > we can get such checking to be uniform. > > And it's not clear what actual harm would come from inconsistent > checking of such minutiae... As I see it the problem is this: in order to be able to treat errors in an UPDATE as Serious, but not Critical, it is essential that the receiver be able to at least identify the NLRI with some (high?) level of confidence. While NLRI are mixed up with the Attributes (a bona fide, first class, copper bottomed KLUDGE, IMNSHO) achieving that requires some careful analysis of how Attributes are parsed when they are broken. Grinding out the detail is sort of an argument for separating NLRI from Attributes. The same detail is relevant to efforts to loosen checks on Attributes, as suggested in draft-ietf-idr-error-handling. The actual harm which could come from inadequate parsing of Attributes, is that some NLRI in a broken UPDATE message could be missed -- that is, some errors in an UPDATE could be treated as Serious, where they should be treated as Critical. It's possible that the risk of such an outcome is acceptable... but I don't think the requirements cover it. ... > > In order to contemplate classifying some attribute errors as > > "Ignorable" or "Recoverable", a more detailed analysis of > > attribute parsing is also required. An ATOMIC_AGGREGATE > > attribute is arguably trivial and Ignorable. But is an > > ATOMIC_AGGREGATE attribute with length of 421 (say) likely > > to be a momentary lapse of concentration at the sender end, > > or more likely to be a symptom of a badly broken set > > of attributes ? > That's something I'd guess most peers can't be bothered to check. Goodness. I do hope that's not the case :-( Chris _______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
