Re: [GROW] draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

Rob Shakir Sun, 24 Jun 2012 02:10:19 -0700

Hi Jim,

Further comments in-line as [rjs].

On 24 Jun 2012, at 03:06, UTTARO, JAMES wrote:

> 1. Conservative -- any error in messages received from a neighbour are 
> indicative that it is not a viable route to any prefix it advertises. 
> Therefore where error conditions occur, disconnect and remove all routes from 
> the neighbour from the RIB.
> 
> [Jim U>] This is based on the current behavior of BGP.. If the NH is still 
> viable you could allow the session to fall and persist the good state that 
> has already been learned..

[rjs]: Absolutely, this is the current behaviour. The problem with taking a 
whole session down in this case is that you now take a risk of inconsistency 
for all NLRI across that session for the duration that you hold onto the 
learned NLRI. If one avoids being in the situation where the session is down 
(e.g., by applying treat-as-withdraw behaviour in cases where one can determine 
the NLRI) then all other NLRI on the session continue to be updated as they 
need to be. It is only the NLRI that were included in the erroneous UPDATE that 
may be affected for looping/black-holing.

> I would expect all solutions implemented in response to these requirements to 
> be optional. If the risk of incorrectness is unacceptable to you/an operator, 
> then you should absolutely not enable any of these mechanisms. In a number of 
> networks that I have operated, designed and architected, I am prepared to 
> accept the risk of incorrectness, as I consider it acceptable when compared 
> to the risk of complete service outages in terms of impact to my customers 
> during such incidents. At the moment, without the work described through the 
> requirements outlined in this draft I do not have the means to make that 
> call...
> [Jim U>] I do not understand how it is possible to make this configurable on 
> a per session or AS basis..I would think all speakers participating in a 
> routing context would have to adhere to the same rules for a consistent view 
> across domains.. In my reading of the IDR draft it seems that it would be a 
> MUST.. Maybe I should not be considering that IDR draft as the actual 
> realization of the reqs..

[rjs]: The IDR draft is the solution for some of the requirements -- 
particularly those described in Section 3 of the GROW draft.

[rjs]: I do not see why this behaviour needs to be consistent across domains?

[rjs]: Essentially, if I receive an invalid UPDATE message, and apply 
treat-as-withdraw, if the advertising speaker did not know that this was 
erroneous then I end up with a different view of what is in the RIB than the 
advertising speaker does. If this was a prefix I had no other route to, then I 
may black-hole, if it was one where it was a more-specific of some larger 
prefix, then we end up with the potential for loops.

[rjs]: If I am prepared to accept the black-holing or loops for the NLRI in the 
erroneous UPDATE as a risk, in favour of keeping the remaining NLRI working 
(and being updated/withdrawn if they change), then this is a local decision and 
I do not need to imply any behaviour of the neighbouring domains.

> 
>> Abstract
>> 
>> Can the scope be expanded? There are other failure modes, i.e Timer Expiry 
>> which today is not considered a failure mode. In reality what I have seen is 
>> that timer expiry occurs due to the fact that BGP threads cannot be serviced 
>> in a timely manner.  I think it would be best if we could put it all on the 
>> table..
>> 
>> I do think that this draft should bound the solution space. At the minimum 
>> the solutions proposed should meet a minimum set of the operators criteria 
>> in terms of managing the network, convergence, persistence, churn, 
>> forwarding impact etc...  
> 
> I think there was clear consensus amongst operators with whom I have spoken 
> to work on the problem space of erroneous UPDATEs - particularly in response 
> to observed incidents. The very real risk of expanding the scope of this 
> document even further to handle any error in the BGP protocol is that we 
> spend another few years cataloguing such conditions, whilst doing nothing 
> about the ones that we have identified.
> [Jim U>] No doubt.. But  I have seen other errors that relate to overloading 
> BGP, topological isolation etc... that have created huge outages in my 
> network.. So, My thinking is that we should try to consider all of the 
> challenges the protocol faces in terms of erroneous error conditions. 

[rjs] Right, perhaps a better title for the draft would be "Operational 
Requirements for Enhanced Error Handling for UPDATE Messages in BGP-4" -- given 
that there were incidents that affected networks that particularly were focused 
on the errors in UPDATEs, it seemed logical that this was the place to start 
for this work. 

> 
>> Section 1.1
>> 
>> The following paragraph is based on the premise that the session being 
>> "down" results in a large impact.. This is certainly true for today's 
>> implementations which use the session ( Control Plane Construct ) to 
>> determine the viability of the forwarding state learned over said session. 
>> There are cases where this is the session and forwarding are parallel, but 
>> in many more cases control and forwarding planes are orthogonal.. I think we 
>> need to re-consider this assumption for many of the services BGP is being 
>> used for and base the response to error conditions on this reality..
>> 
>> " Both within Internet and multi-service routing architectures, a
>>   number of BGP sessions propagate a large proportion of the required
>>   routing information for network operation.  For Internet routing,
>>   these are typically BGP sessions which propagate the global routing
>>   table to an AS - failure of these sessions may have a large impact on
>>   network service, based on a single erroneous update.  In an multi-
>>   service environment, typical deployments utilise a small number of
>>   core-facing BGP sessions, typically towards route reflector devices.
>>   Failure of these sessions may also result in a large impact to
>>   network operation.  Clearly, the avoidance of conditions requiring
>>   these sessions to fail is of great utility to any network operator,
>>  and provides further motivation for the revision of the existing
>>   behaviour. "
> 
> I do not understand what the assertion that you are making here is. Please 
> could you explain it to me? The errors that are being discussed relate to 
> where a subset of NLRI are advertised within an erroneous UPDATE message, and 
> the resulting impact of the current protocol behaviour on all other NLRI 
> carried on that session. The cases of IP "transit" sessions, and RR-PE 
> sessions are only examples of cases where there is a large amount of routing 
> information carried over a single session - and hence it is of utility to 
> avoid these sessions failing where they do not necessarily need to based on 
> the impact to overall network service.
> [Jim U>] The solution space here seems explicitly targeted to the internet 
> IPV4 AF. Not sure if there is a dependency here on the control/forwarding 
> planes being in parallel. I believe we need to consider all AFs that BGP is 
> used for.. As the code that would be developed would be applicable to these 
> AF also ( I presume ). So as an example would be RT-C, is this solution 
> applicable I don't know I am simply asking.. If we want to develop this and 
> have it specific to the internet use case lets state that clearly. If not, 
> then let's consider the other applications BGP supports..

[rjs] I'd say that it's not just applicable to IPv[46] in the Internet - but to 
numerous AFIs (there is a definite use-case for these solutions in L3VPN 
environments for instance). I am not saying that this is applicable or 
desirable to be turned on for all AFIs -- but it seems to me that this is a 
per-operator, per-deployment decision, not a per-AFI one. For instance, if we 
get an RTC UPDATE that is malformed, an operator may not want to tear down a 
session if it also carries other AFIs (e.g., VPNv[46] also) - in that case, the 
operator may want to treat this UPDATE as withdrawing the {as, route-target} 
NLRI (consider that we have no *standardised* multi-session mechanism yet, and 
there are potential scaling impacts of multiple sessions).

> 
> This point (to me anyway) seems entirely related to the control-plane -- it 
> points out that an operator has cases where one really wants to keep the 
> impact of errors down to the particular subset of routing information that is 
> affected. This point is entirely in the protocol, rather than implying any 
> behaviour about forwarding (i.e., no implication is made that the NLRI 
> identified as carried in the erroneous UPDATE are installed in the FIB, but 
> rather that all *other* NLRI continue to be installed in the RIB).
> [Jim U>] See above.. My point is how can we ensure reliability across 
> AFs...There is no doubt that mal-formed updates are an issue I just do not 
> know how they affect other AFs, in those case it may be more appropriate to 
> tear down and persist instead. Can we address these other use cases?

[rjs]: The consideration (that I see) that is missing from the draft w.r.t this 
point seems to be more "what would break if one utilises treat-as-withdraw and 
this is not {IPv[46],VPNv[46]} etc. Is this what you feel needs to be addressed?

[rjs]: I think the general though process of:
        1/ In some cases, we want avoid affecting all NLRI based on an error in 
a subset of the received NLRI.
        2/ In this case, we may need to recover from this inconsistency.
        3/ In some cases, we may want to give a session a chance to restart 
where we could not handle the error gracefully.
        4/ When errors occur on these sessions, there is great operational 
benefit of flagging them more explicitly.

[rjs]: is applicable across all AFIs almost. The distinction is really whether 
one considers that 1/ is a valid behaviour/risk for an AFI? 

> 
> I'll work through the points that you have made in this mail -- but I'd like 
> to respond with a specific point here comparing more targeted handling of 
> errors in UPDATE messages with persistence-type approaches. 
> 
> - Persistence/hold-up is not acceptable to all operators in all deployments. 
> Particularly, where no liveliness detection mechanism for the next-hop is 
> available (i.e., where it is not possible or practical to determine the NH's 
> reachability in the forwarding plane other than with with the routing 
> protocol itself) then one would not want to accept the risk of running 
> persistence.
> [Jim U>] Agreed.. Persistence requires that the NH be viable. 
> 
> - Therefore, an operator has two options for these deployments -- stick with 
> the error handling behaviour that is available in BGP right now, which gives 
> him no flexibility in terms of focusing responses to errors in a manner that 
> is proportional to the actual error; or alternatively, accept some additional 
> complexity and risk such that particular error handling is targeted to the 
> NLRI that are contained in an UPDATE message that is found to be erroneous.
> [Jim U>] My concern here is that the entire support structure is based on 
> session viability, the expectation here is that operators need to develop new 
> approaches to understanding if there is a routing anomaly in their topology 
> and develop tools/procedures which are different.

[rjs]: Yes, this is true. We introduce a more complex failure mode where a 
subset of prefixes are affected. *Any* new mechanism changing the tear-down and 
flush-the-RIB behaviour of BGP will need some different approaches to be 
developed. The intention of §6 and §7 of the draft is to highlight where things 
do become more complex and define requirements for ensuring that the right 
toolset exists around this behaviour.

[rjs]: Simplifying networks/removing complexity should always be a goal -- but 
we need to understand where simple behaviour does not provide the levers to 
limit the impact of errors occurring, and balance the complexity of introducing 
new mechanisms against the benefit to our network's operation. 

[rjs]: I would like to think that the document highlights where complexity is 
being introduced -- and highlights the impact of turning some of these knobs 
on. If the complexity is not acceptable within a deployment, these mechanisms 
should not be deployed. However, where it is, we don't have the knobs to do 
anything today! If the document doesn't address this to your satisfaction, 
please let me know, and I'd be happy to make sure that the complexity is 
further highlighted.

> Given there are deployments where I cannot deploy persistence (e.g., towards 
> an Internet network peer on an IXP where deployment of BFD/OAM mechanisms are 
> very limited so BGP really tells me whether the other peer is "there" at all) 
> yet in these cases I do not want to affect all NLRI when one erroneous UPDATE 
> is received then it would seem that I need answers to the requirements in 
> this draft. 
> [Jim U>] Agreed.. 
> 
> Your e-mail seems to imply that you do not have cases where you will not be 
> able to deploy persistence and that you are accepting of session-level error 
> handling in these cases. Would this be a fair assessment?
> [Jim U>] My work primarily revolves around the following BGP use cases VPN ( 
> L2/L3 ), 3107, Multi-Cast, RT-C, Flowspec not Internet.. In these 
> environments I would prefer that the session fail, persistence is activated, 
> and well known procedures are used to correct the problem. I am in no way 
> saying that the use case described here for the internet application is not 
> viable or the correct approach.. But the fact remains that the code base will 
> be used across all AFs and this solution requires different 
> OpS/Troubleshooting knowledge and support. I would prefer consistency in the 
> behavior.. 

[rjs]: I am involved in multiple network deployments across the company in 
which I work, both public and private. I understand that there are different 
requirements across different networks, and in different deployment cases. 

[rjs]: The problem of any mechanism is that it *may* be available for places 
where it is not applicable. We could analyse persistence and say that for 
Internet deployments, it has the potential to be harmful -- a device that has 
failed will continue to look like a valid path when it might not be, where a 
network in the Internet DFZ is likely to have a number of alternate paths that 
could be used. In my view, in this draft, we're just taking the "lowest common 
denominator" approach - where we say, "how can we build mechanisms that could 
be applied across all AFIs?". I think in all the use cases you mentioned, there 
are deployments where these mechanisms are applicable. In all cases, one valid 
behaviour would be that we keep the subset of NLRI that were not affected from 
a neighbour, but do not use those that are not necessarily trustworthy because 
they were linked to an invalid message in the protocol. Of course, 
operationally, you may not want to do this, based on the risk of inconsistency 
and the impact of failure. The key point here is that my network deployment may 
differ to yours, and its up to both of us to decide for our networks, but we 
both need the tools to be able to make these choices. The problem with a 
consistent approach is that I don't see how we can have a consistent approach 
that isn't based on the lowest common denominator AFI - and even then, our 
logic may differ in terms of where we want to turn it on.

Kind regards,
r.

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

Reply via email to