Re: [GROW] draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

UTTARO, JAMES Sat, 23 Jun 2012 19:06:59 -0700

Rob,

        Comments In-Line...

Thanks,
        Jim Uttaro

-----Original Message-----
From: Rob Shakir [mailto:[email protected]] 
Sent: Saturday, June 23, 2012 6:39 PM
To: UTTARO, JAMES
Cc: '[email protected]'; 'idr wg'
Subject: Re: draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

Hi Jim,

Thanks very much for the detailed review of this document.

On 22 Jun 2012, at 15:56, UTTARO, JAMES wrote:

> Rob,
>  
> Following find my comments..
>  
> Thanks,
>                 Jim Uttaro
>  
> General Comment,
>  
> From a philosophical perspective I agree with the goals of this draft but I 
> do not agree with an approach that maintains a session in the face of a 
> failure in the machinery. This is a bottom up approach which will always be a 
> day late and a dollar short as we continue to patch what it means to be a 
> viable session.. I believe the proper approach to meet your reqs ( and mine 
> too!!! ) is a top down which does not change the BGP behavior but changes the 
> response to that behavior.. The reality of today's BGP fields of use which 
> are many and varied is that the control and the forwarding paths are 
> orthogonal, the state being carried is not always paths as we usually 
> consider them. Examples include RT-C, Flowspec which are used to create PWs 
> and simulate an IGP etc... I do not believe changing the behavior of BGP 
> session viability machinery to accomplish maintaining valid forwarding in the 
> face of a control plane failure is the correct approach
>  
> The draft addresses a very important error condition but ( Update Error ).. 
> This should be expanded to other areas where the failure of a session does 
> not indicate that a subset or all of the routing state learned over said 
> session is invalid.. I think the draft should address this directly.. Are the 
> changes here configurable? Can I turn this off if I do not want this behavior 
> for certain topologies, AFs?

The scope for this draft was particularly to handle errors that occurred in the 
DFZ (and in some private networks) where UPDATE messages with malformed 
contents representing a subset of NLRI carried via a session (1 prefix in 
numerous cases) resulted in complete failure of these sessions. Particularly, 
errors where it was a malformed optional transitive attribute that was 
"tunnelled" across multiple speakers who did not parse it were particularly 
destructive in the DFZ. Jonathan Oddy, Andy Davidson and myself spent some 
effort analysing one of these incidents back in 2008/2009 -- 
http://mailman.nanog.org/pipermail/nanog/2009-January/006816.html. 
[Jim U>] Got it..

One of the complexities in this problem space is that there are multiple 
underlying philosophies as to how such error behaviour should be handled:

1. Conservative -- any error in messages received from a neighbour are 
indicative that it is not a viable route to any prefix it advertises. Therefore 
where error conditions occur, disconnect and remove all routes from the 
neighbour from the RIB.

[Jim U>] This is based on the current behavior of BGP.. If the NH is still 
viable you could allow the session to fall and persist the good state that has 
already been learned..

2. Balanced risk -- some errors are not indicative of an error on the remote 
speaker, or are localised to specific NLRI/prefixes that are propagated via 
this speaker. Therefore take a balanced approach of applying error handling 
mechanisms to those NLRI, but continue to trust the integrity of the speaker 
for the remainder of routing information.

Essentially, this draft's scope was to explain the requirements that exist to 
manage the risk of taking the latter view. At the moment, there is a risk to an 
operator presented by the first (a malformed UPDATE may break the BGP sessions 
that run between their PEs and RRs, or those that propagate their routes to the 
Internet DFZ, and cause a service outage), and there is no option to balance 
the impact of that risk against the impact of BGP being incorrect for a subset 
of prefixes propagated via those sessions.
[Jim U>] Yes I have seen this..

I would expect all solutions implemented in response to these requirements to 
be optional. If the risk of incorrectness is unacceptable to you/an operator, 
then you should absolutely not enable any of these mechanisms. In a number of 
networks that I have operated, designed and architected, I am prepared to 
accept the risk of incorrectness, as I consider it acceptable when compared to 
the risk of complete service outages in terms of impact to my customers during 
such incidents. At the moment, without the work described through the 
requirements outlined in this draft I do not have the means to make that call...
[Jim U>] I do not understand how it is possible to make this configurable on a 
per session or AS basis..I would think all speakers participating in a routing 
context would have to adhere to the same rules for a consistent view across 
domains.. In my reading of the IDR draft it seems that it would be a MUST.. 
Maybe I should not be considering that IDR draft as the actual realization of 
the reqs..

>  Abstract
>  
> Can the scope be expanded? There are other failure modes, i.e Timer Expiry 
> which today is not considered a failure mode. In reality what I have seen is 
> that timer expiry occurs due to the fact that BGP threads cannot be serviced 
> in a timely manner.  I think it would be best if we could put it all on the 
> table..
>  
> I do think that this draft should bound the solution space. At the minimum 
> the solutions proposed should meet a minimum set of the operators criteria in 
> terms of managing the network, convergence, persistence, churn, forwarding 
> impact etc...  

I think there was clear consensus amongst operators with whom I have spoken to 
work on the problem space of erroneous UPDATEs - particularly in response to 
observed incidents. The very real risk of expanding the scope of this document 
even further to handle any error in the BGP protocol is that we spend another 
few years cataloguing such conditions, whilst doing nothing about the ones that 
we have identified.
[Jim U>] No doubt.. But  I have seen other errors that relate to overloading 
BGP, topological isolation etc... that have created huge outages in my 
network.. So, My thinking is that we should try to consider all of the 
challenges the protocol faces in terms of erroneous error conditions. 

>  Section 1.1
>  
> The following paragraph is based on the premise that the session being "down" 
> results in a large impact.. This is certainly true for today's 
> implementations which use the session ( Control Plane Construct ) to 
> determine the viability of the forwarding state learned over said session. 
> There are cases where this is the session and forwarding are parallel, but in 
> many more cases control and forwarding planes are orthogonal.. I think we 
> need to re-consider this assumption for many of the services BGP is being 
> used for and base the response to error conditions on this reality..
>  
> " Both within Internet and multi-service routing architectures, a
>    number of BGP sessions propagate a large proportion of the required
>    routing information for network operation.  For Internet routing,
>    these are typically BGP sessions which propagate the global routing
>    table to an AS - failure of these sessions may have a large impact on
>    network service, based on a single erroneous update.  In an multi-
>    service environment, typical deployments utilise a small number of
>    core-facing BGP sessions, typically towards route reflector devices.
>    Failure of these sessions may also result in a large impact to
>    network operation.  Clearly, the avoidance of conditions requiring
>    these sessions to fail is of great utility to any network operator,
>   and provides further motivation for the revision of the existing
>    behaviour. "

I do not understand what the assertion that you are making here is. Please 
could you explain it to me? The errors that are being discussed relate to where 
a subset of NLRI are advertised within an erroneous UPDATE message, and the 
resulting impact of the current protocol behaviour on all other NLRI carried on 
that session. The cases of IP "transit" sessions, and RR-PE sessions are only 
examples of cases where there is a large amount of routing information carried 
over a single session - and hence it is of utility to avoid these sessions 
failing where they do not necessarily need to based on the impact to overall 
network service.
[Jim U>] The solution space here seems explicitly targeted to the internet IPV4 
AF. Not sure if there is a dependency here on the control/forwarding planes 
being in parallel. I believe we need to consider all AFs that BGP is used for.. 
As the code that would be developed would be applicable to these AF also ( I 
presume ). So as an example would be RT-C, is this solution applicable I don't 
know I am simply asking.. If we want to develop this and have it specific to 
the internet use case lets state that clearly. If not, then let's consider the 
other applications BGP supports..

This point (to me anyway) seems entirely related to the control-plane -- it 
points out that an operator has cases where one really wants to keep the impact 
of errors down to the particular subset of routing information that is 
affected. This point is entirely in the protocol, rather than implying any 
behaviour about forwarding (i.e., no implication is made that the NLRI 
identified as carried in the erroneous UPDATE are installed in the FIB, but 
rather that all *other* NLRI continue to be installed in the RIB).
[Jim U>] See above.. My point is how can we ensure reliability across 
AFs...There is no doubt that mal-formed updates are an issue I just do not know 
how they affect other AFs, in those case it may be more appropriate to tear 
down and persist instead. Can we address these other use cases?

I'll work through the points that you have made in this mail -- but I'd like to 
respond with a specific point here comparing more targeted handling of errors 
in UPDATE messages with persistence-type approaches. 

- Persistence/hold-up is not acceptable to all operators in all deployments. 
Particularly, where no liveliness detection mechanism for the next-hop is 
available (i.e., where it is not possible or practical to determine the NH's 
reachability in the forwarding plane other than with with the routing protocol 
itself) then one would not want to accept the risk of running persistence.
[Jim U>] Agreed.. Persistence requires that the NH be viable. 

- Therefore, an operator has two options for these deployments -- stick with 
the error handling behaviour that is available in BGP right now, which gives 
him no flexibility in terms of focusing responses to errors in a manner that is 
proportional to the actual error; or alternatively, accept some additional 
complexity and risk such that particular error handling is targeted to the NLRI 
that are contained in an UPDATE message that is found to be erroneous.
[Jim U>] My concern here is that the entire support structure is based on 
session viability, the expectation here is that operators need to develop new 
approaches to understanding if there is a routing anomaly in their topology and 
develop tools/procedures which are different.

Given there are deployments where I cannot deploy persistence (e.g., towards an 
Internet network peer on an IXP where deployment of BFD/OAM mechanisms are very 
limited so BGP really tells me whether the other peer is "there" at all) yet in 
these cases I do not want to affect all NLRI when one erroneous UPDATE is 
received then it would seem that I need answers to the requirements in this 
draft. 
[Jim U>] Agreed.. 

Your e-mail seems to imply that you do not have cases where you will not be 
able to deploy persistence and that you are accepting of session-level error 
handling in these cases. Would this be a fair assessment?
[Jim U>] My work primarily revolves around the following BGP use cases VPN ( 
L2/L3 ), 3107, Multi-Cast, RT-C, Flowspec not Internet.. In these environments 
I would prefer that the session fail, persistence is activated, and well known 
procedures are used to correct the problem. I am in no way saying that the use 
case described here for the internet application is not viable or the correct 
approach.. But the fact remains that the code base will be used across all AFs 
and this solution requires different OpS/Troubleshooting knowledge and support. 
I would prefer consistency in the behavior.. 

Kind regards,
r.

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

Reply via email to