[GROW] draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

UTTARO, JAMES Fri, 22 Jun 2012 08:22:17 -0700

Rob,

Following find my comments..


Thanks,
                Jim Uttaro

General Comment,

>From a philosophical perspective I agree with the goals of this draft but I do 
>not agree with an approach that maintains a session in the face of a failure 
>in the machinery. This is a bottom up approach which will always be a day late 
>and a dollar short as we continue to patch what it means to be a viable 
>session.. I believe the proper approach to meet your reqs ( and mine too!!! ) 
>is a top down which does not change the BGP behavior but changes the response 
>to that behavior.. The reality of today's BGP fields of use which are many and 
>varied is that the control and the forwarding paths are orthogonal, the state 
>being carried is not always paths as we usually consider them. Examples 
>include RT-C, Flowspec which are used to create PWs and simulate an IGP etc... 
>I do not believe changing the behavior of BGP session viability machinery to 
>accomplish maintaining valid forwarding in the face of a control plane failure 
>is the correct approach

The draft addresses a very important error condition but ( Update Error ).. 
This should be expanded to other areas where the failure of a session does not 
indicate that a subset or all of the routing state learned over said session is 
invalid.. I think the draft should address this directly.. Are the changes here 
configurable? Can I turn this off if I do not want this behavior for certain 
topologies, AFs?

Abstract

Can the scope be expanded? There are other failure modes, i.e Timer Expiry 
which today is not considered a failure mode. In reality what I have seen is 
that timer expiry occurs due to the fact that BGP threads cannot be serviced in 
a timely manner.  I think it would be best if we could put it all on the table..

I do think that this draft should bound the solution space. At the minimum the 
solutions proposed should meet a minimum set of the operators criteria in terms 
of managing the network, convergence, persistence, churn, forwarding impact 
etc...

Section 1.1

The following paragraph is based on the premise that the session being "down" 
results in a large impact.. This is certainly true for today's implementations 
which use the session ( Control Plane Construct ) to determine the viability of 
the forwarding state learned over said session. There are cases where this is 
the session and forwarding are parallel, but in many more cases control and 
forwarding planes are orthogonal.. I think we need to re-consider this 
assumption for many of the services BGP is being used for and base the response 
to error conditions on this reality..

" Both within Internet and multi-service routing architectures, a
   number of BGP sessions propagate a large proportion of the required
   routing information for network operation.  For Internet routing,
   these are typically BGP sessions which propagate the global routing
   table to an AS - failure of these sessions may have a large impact on
   network service, based on a single erroneous update.  In an multi-
   service environment, typical deployments utilise a small number of
   core-facing BGP sessions, typically towards route reflector devices.
   Failure of these sessions may also result in a large impact to
   network operation.  Clearly, the avoidance of conditions requiring
   these sessions to fail is of great utility to any network operator,
  and provides further motivation for the revision of the existing
   behaviour. "

Section 1.2

Bullet 1..


This is a very interesting point.. As you stated



"  Traditional network architectures would deploy an Interior Gateway

   Protocol (IGP) to carry infrastructure and customer prefixes, with an

   Exterior Gateway Protocol (EGP) such as BGP being utilised to

   propagate these prefixes to other Autonomous Systems. "



In this environment where BGP was predominantly used to advertised state 
between AS domains over dedicated peering points it would makes sense that a 
malformed update learned from a peer is a fairly good indication that the peer 
which is originating the update is suspect.. As there are other NHs available 
it would be prudent to not use the suspect session/forwarding path which in 
these cases were in parallel.. Maybe "treat as withdraw" is appropriate not 
sure, although the SP should be able to decide the course of action .



I would ask you to consider that there are two cases here.. The first is when a 
speaker learns an update from a peer where the NH for the paths in that update 
is the direct peer ( ASBR, PE ), and the second whereby the update is from a 
peer where the NH for the paths in the update is not the peer ( RR ). In the 
former case is there still a case that the original premise holds..The 
offending egress router should be disconnected from the topology. I don't think 
this is black and white and operators may want the flexibility of determining 
the behavior..



Bullet 2



Totally agree.. This is the one of the major requirements driving the BG 
Persistence Draft..



Bullet 3



Not sure exactly what the intent here is.. I am all for more robust NM and 
visibility...



Section 2



I think understand where you are coming from for the first set of errors.. Do 
we have a feel as to why there would be erroneous data? It would seem that 
whichever speaker created/modified etc.. the attr is experiencing some 
fundamental issue with the BGP machinery as it is not validating the context??



"Since in this case, the message

   received from the remote peer is syntactically valid, it is

   considered that such an UPDATE is indicative of erroneous data within

   a path attribute."



Section 2.1, 2.1.1, 2.1.2



Hmmm. Too be honest this make me a bit nervous.. This introduces classes of 
failure modes.. Downstream NM, Ops etc... will have to learn how to support 
these nuanced messages and intended meanings.. There is one thing certain, when 
a BGP Session goes down in today's world it merits the immediate attn of 
operations to figure out what went wrong and to fix it.. Solutions like GR or 
BGP Persistence do not change the trigger for Operations response..



Section 3



The premise is to modify when a NOTIFICATION message is sent thus mitigating 
the possibility that the session becomes invalid or to use a "treat as 
withdraw" for those paths sharing this common attribute.. The issue in mind is 
the notion of the session as the be all to end all.. We could as easily without 
any change to BGP use BGP Persistence to maintain the paths except for the ones 
that have the invalid attribute.. This is the simpler method, has the benefit 
of not changing BGP, or educating the world on the nuances of the changes etc...



I also do not fully understand "treat as withdraw" does this meant that the 
peer who has received an update with P1-PN with malformed attr then initiate a 
withdrawal to all of its peers?  Or simply assume that the paths have been 
received as a message?  Some sample topologies as to how this works would be a 
good addition to this section..



Section 4



I made some comments on a solution "Re: [Idr] I-D Action: 
draft-ietf-idr-error-handling-02.txt" which I think is intended to address some 
aspects of this reqs doc. It seems that there is a possibility of forwarding 
loops that can be created and the inability of BGP to recover without operator 
intervention.



" There are therefore risks of traffic blackholing, due to

   missing routing information, or forwarding loops.  Whilst this is

   deemed an acceptable compromise in the short term, clearly, it is

   suboptimal.  Therefore, a requirement exists to provide mechanisms by

   which a BGP speaker is able to recover the consistency of the Adj-

   RIB-In for a particular neighbour."



I do not think that the above draft which does not recover addresses this req..





I am not in support of solutions which create a scenario where BGP cannot 
recover without human intervention.



"It is of particular note for both means of recovering RIB consistency

   described that these are effective only when considering transitive

   errors within an implementation - for instance, should an RFC

   interpretation error within an implementation be present, regardless

   of the number of times a specific UPDATE is generated, it is likely

   that this error condition will persist (as it may with the existing

   behaviour defined by [RFC4271])."



I think the difference is that the operator knows that a serious situation is 
occurring by virtue of the session failing and can take appropriate action to 
correct. Forwarding can be maintained using Persistence or GR.. So IMO I can 
know a serious issue exists using existing BGP Machinery and well known 
procedures and at the same time maintain my forwarding..



Section 5



Why wouldn't we simply let the session fail and then use BGP Persistence or GR 
;)



Section 6



Nothing is going to get people's attention like a failed BGP Session.. I can 
only speak for my org but this is what is actively monitored as the bases for 
the health of BGP.. Diagnostic, log etc... messages are of high volume and does 
not get folks attn.. IMO this is a major disadvantage to this approach.. We do 
not know how convey that something bad has happened to the BGP Machinery..



Section 7



Although I understand the distinction between the different classes of Update 
Errors, I am not sure that I understand how that translates into how the BGP 
Machinery is actually being impacted. The increased complexity of 
"understanding" the type of update error message, it's cause, possible 
remediation etc.. makes this approach really difficult to wrap ones arms 
around..

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

[GROW] draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

Reply via email to