Hi All, I'd like to ask the WG their collective opinion on a couple of matters in this draft, which come from some discussions at IETF83 (in particular with John Scudder and Adam Simpson) about how the requirements are currently written regarding repeated errors.
The wording right now in the draft implies that at some point, the speaker that detects the failure could push the session into a "hold up" state ignoring any further UPDATE errors (i.e. at some point the 'critical' error in the UPDATE is just ignored) and the prefixes that are are there are maintained. In the case of implementation errors, it is possible that the BGP session in question may enter a continuous loop of being reset, with a partial RIB being held by one or more of the BGP speakers due to an non-deterministic order of UPDATE propagation. It is therefore a requirement that within this reduced-impact procedure any subsequent UPDATE messages that would result in further session resets are ignored. Whilst this results in a condition where an undetermined amount of the RIB is inconsistent, partial reachability is maintained. In this case, the operational toolsets discussed in Section 6 is likely to provide mechanisms by which this condition can be brought to the attention of the relevant operators. This requirement to accept a partial RIB, which results in potential invalid traffic forwarding is a direct result of the deployments of BGP-4, as described in Section 1.1. I think that actually, the discussions that I've had with various WG participants relating to this imply that at some point instead of ignoring the UPDATE, the session should be considered no longer viable, and the session ceased. I think that the latter view is relevant only to Critical errors (i.e. those whereby the NLRI cannot be identified) that cause session reset - and that the implication should be at that at some point there is a requirement /not/ to continually keep applying the mechanisms that are specified in Section 5 of the draft. This is the behaviour that is implied in Section 7.1 of the draft. The original wording of the draft did not draw a distinction between the behaviour expected for Semantic and Critical errors - which I think meant that the above recommendation was more relevant (one would not want to teardown based on errors that could be handled with treat-as-withdraw). Given that repeated Critical errors require repeated session-level error handling, this would seem to be indicative of a relatively long-lived failure (which is unlikely to recover) and hence at some point, continually performing hitless restarts becomes a futile exercise. So, as such, given that there is an internal lack of consistency between these two sections - the suggestion I am proposing is to restructure Section 7 of the draft, such that it recommends: - For Semantic errors, a staged recovery approach should be taken (should any automatic recovery be attempted) such that a specific recovery mechanism is implemented, in preference to re-requesting the entire RIB (so as to reduce the number of messages that are received). - Where such Semantic errors are repeated, at some point (defined per-recovery mechanism memo or on a per-implementation basis based on the resource consumption of the recovery mechanism), no further automatic refresh action should be taken as a result of further Semantic errors. Such a condition should be flagged to an operator to flag this has taken place. - For repeated Critical errors, an implementation maintains some "session badness" flag, which can be used to work out that the "hitless session restart" mechanisms (ยง5) should no longer be applied. This condition should be flagged to an operator. - There is no requirement within the "hitless session restart" mechanism to have any form of delay in terms of re-opening the session, since this must occur inside the time bound for the session restart described in Section 5. - (As per John's suggestion) It is recommended that there is an exponential back-off for session re-establishment following the decision to stop using hitless restart procedures for a session (as per the existing definition for peer oscillation in RFC4271). Is the WG in agreement with this recommendation, and restructuring of this section? Many thanks, r. _______________________________________________ GROW mailing list [email protected] https://www.ietf.org/mailman/listinfo/grow
