[GROW] Repeated Errors in BGP - draft-ietf-grow-ops-reqs-for-bgp-error-handling

Rob Shakir Fri, 13 Apr 2012 11:27:50 -0700

Hi All,

I'd like to ask the WG their collective opinion on a couple of matters in this 
draft, which come from some discussions at IETF83 (in particular with John 
Scudder and Adam Simpson) about how the requirements are currently written 
regarding repeated errors.


The wording right now in the draft implies that at some point, the speaker that 
detects the failure could push the session into a "hold up" state ignoring any 
further UPDATE errors (i.e. at some point the 'critical' error in the UPDATE is 
just ignored) and the prefixes that are are there are maintained.

   In the case of implementation errors, it is
   possible that the BGP session in question may enter a continuous loop
   of being reset, with a partial RIB being held by one or more of the
   BGP speakers due to an non-deterministic order of UPDATE propagation.
   It is therefore a requirement that within this reduced-impact
   procedure any subsequent UPDATE messages that would result in further
   session resets are ignored.  Whilst this results in a condition where
   an undetermined amount of the RIB is inconsistent, partial
   reachability is maintained.  In this case, the operational toolsets
   discussed in Section 6 is likely to provide mechanisms by which this
   condition can be brought to the attention of the relevant operators.
   This requirement to accept a partial RIB, which results in potential
   invalid traffic forwarding is a direct result of the deployments of
   BGP-4, as described in Section 1.1.

I think that actually, the discussions that I've had with various WG 
participants relating to this imply that at some point instead of ignoring the 
UPDATE, the session should be considered no longer viable, and the session 
ceased.

I think that the latter view is relevant only to Critical errors (i.e. those 
whereby the NLRI cannot be identified) that cause session reset - and that the 
implication should be at that at some point there is a requirement /not/ to 
continually keep applying the mechanisms that are specified in Section 5 of the 
draft. This is the behaviour that is implied in Section 7.1 of the draft.

The original wording of the draft did not draw a distinction between the 
behaviour expected for Semantic and Critical errors - which I think meant that 
the above recommendation was more relevant (one would not want to teardown 
based on errors that could be handled with treat-as-withdraw). Given that 
repeated Critical errors require repeated session-level error handling, this 
would seem to be indicative of a relatively long-lived failure (which is 
unlikely to recover) and hence at some point, continually performing hitless 
restarts becomes a futile exercise.

So, as such, given that there is an internal lack of consistency between these 
two sections - the suggestion I am proposing is to restructure Section 7 of the 
draft, such that it recommends:

- For Semantic errors, a staged recovery approach should be taken (should any 
automatic recovery be attempted) such that a specific recovery mechanism is 
implemented, in preference to re-requesting the entire RIB (so as to reduce the 
number of messages that are received).
- Where such Semantic errors are repeated, at some point (defined per-recovery 
mechanism memo or on a per-implementation basis based on the resource 
consumption of the recovery mechanism), no further automatic refresh action 
should be taken as a result of further Semantic errors. Such a condition should 
be flagged to an operator to flag this has taken place.
- For repeated Critical errors, an implementation maintains some "session 
badness" flag, which can be used to work out that the "hitless session restart" 
mechanisms (§5) should no longer be applied. This condition should be flagged 
to an operator.
- There is no requirement within the "hitless session restart" mechanism to 
have any form of delay in terms of re-opening the session, since this must 
occur inside the time bound for the session restart described in Section 5.
- (As per John's suggestion) It is recommended that there is an exponential 
back-off for session re-establishment following the decision to stop using 
hitless restart procedures for a session (as per the existing definition for 
peer oscillation in RFC4271).

Is the WG in agreement with this recommendation, and restructuring of this 
section?

Many thanks,
r.



_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

[GROW] Repeated Errors in BGP - draft-ietf-grow-ops-reqs-for-bgp-error-handling

Reply via email to