Re: [GROW] WGLC: draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

Shane Amante Wed, 11 Jul 2012 09:51:47 -0700

Hi Rob,

On Jul 10, 2012, at 10:27 AM, Rob Shakir wrote:
> Hi Shane,
> 
> Thanks very much for the detailed review of this document, and apologies it 
> has taken a short while for me to reply.


No problem, on both counts.  See below.


> On 25 Jun 2012, at 22:15, Shane Amante wrote:
> 
>> I've read this draft and support it being published as an Informational RFC.
>> 
>> I do have some comments, which are not blocking, but that I do hope improve 
>> the clarity & readability of the draft.
>> 
>> 1)  In Section 1.1, the following paragraph hints at the problems with 
>> current network architectures, but doesn't really come out and say what they 
>> are more directly.
>> ---snip---
>> Traditional network architectures would deploy an Interior Gateway
>> Protocol (IGP) to carry infrastructure and customer prefixes, with an
>> Exterior Gateway Protocol (EGP) such as BGP being utilised to
>> propagate these prefixes to other Autonomous Systems.  [...]
>> ---snip---
>> IMO, it would be better if the following section more directly said that 
>> scale, in terms of the sheer amount of routing information in BGP, has 
>> increased substantially, particularly in places like the core of a Route 
>> Reflection hierarchy with order _several_ million paths, and rising.  I wish 
>> I could point you at a link to a talk Danny & I gave to GROW and/or IDR at 
>> IETF 71 (?), but unfortunately I can't find a copy of the slides on the 
>> IETF's Web site.  You can see that it's in the GROW WG Agenda at IETF 71, 
>> but the slides link to a different presentation: 
>> <http://tools.ietf.org/wg/grow/agenda?item=agenda71.html>
>> 
>> Anyway, IMO, this is (from my vantage point) the single factor that makes 
>> outages not only much more visible, but /also/ (just as importantly) take 
>> longer to recover from ... (at the end of the day, CPU is finite, there are 
>> no "easy" answers wrt using multi-core processors to speed-up convergence of 
>> single AFI's, DRAM speeds are not keeping pace, etc.)  This may be useful to 
>> mention here to "set the stage", as it were, since throughout the draft 
>> there's a continuous (and, good) dialogue of not making convergence times 
>> worse or CPU load increase as a result of incorporating better recovery 
>> capabilities.  IOW, another benefit you don't really mention anywhere, but 
>> probably should, is that /theoretically/ some of these mechanisms may help 
>> further promote scalability of BGP.  Obviously, we won't know that until we 
>> are evaluating solutions and researchers have time to study them, etc.
> 
> [rjs]: I tried to add something to cover this that fits in with Section 1.1:
> 
>                        <t>
>                            The combination of the increased number of 
> deployments of BGP-4 as an intra-AS routing protocol, its use for the 
> propagation of additional types of routing and service information, and the 
> growth of IP services has resulted in a substantial increase in the volume of 
> information carried within BGP-4. In numerous networks, RIB sizes of the 
> order of millions of entries exist, with particular high-scale points 
> existing at BGP speakers performing aggregation or functionality designed 
> improve utilisation of network resources (e.g., route reflector hierarchies). 
> Whilst clearly an increase in the amount routing information carried in BGP 
> results in greater impact to services during failures, it is also critical to 
> their recovery time. The increased time to compute new paths following a 
> failures and subsequently re-learn them following recoveries results in 
> greater impact of failures within the protocol, and hence adds further weight 
> to the requirement to 
 avoid failures affecting all routing, or service, information carried via a 
particular adjacency. Whilst an argument could be made the convergence time of 
BGP-4 can be reduced through additional computational resource being deployed, 
it is notable that significant challenges continue to exist for operators of 
scaling BGP-4, and hence mechanisms which improve the scalability of the 
protocol are of particular note.
>                        </t>


The above looks good, but I've made some minor modifications.  See below.
---snip---
The combination of the increased number of deployments of BGP-4 as an intra-AS 
routing protocol, its use for the propagation of additional types of routing 
and service information, and the growth of IP services has resulted in a 
substantial increase in the volume of information carried within BGP-4. In 
numerous networks, RIB sizes of the order of millions of entries exist within 
individual BGP speakers, with particularly high-scale points exhibited at BGP 
speakers performing aggregation or functionality designed improve utilisation 
of network resources (e.g., route reflector hierarchies). Whilst clearly an 
increase in the amount routing information carried in BGP results in greater 
impact to services during failures, which is only amplified by a corresponding 
increase in recovery times. Following a failure, there is a substantial 
recovery time to learn, compute and distribute new paths, which results in a 
greater observed impact to services affected, and hence adds further we
 ight to the requirement to avoid failures altogether or, at least, mitigate 
their impact to the narrowest scope possible, (e.g.: a specific NLRI). Whilst 
an argument could be made that convergence time of BGP-4 could potentially be 
reduced through deployment of additional computational resource, it is notable 
that solution is not necessarily straightforward from an implementation or 
deployment point-of-view, (e.g.: scaling computation resources within a single 
address-family is difficult).  Thus, significant challenges continue to exist 
for operators when scaling BGP-4 deployments, and hence mechanisms which 
improve the scalability of BGP-4 are very important.
---snip---


> [rjs]: Do you think this captures the point that was missing? I tried to 
> avoid making any point that asserts that the requirements defined are going 
> to make BGP scale better, but rather just call out that operationally, 
> throwing resources at the problem isn't something that's always possible, so 
> anything that reduces the amount of "work" BGP needs to do is interesting.

I definitely agree.


>> Related to the above, there is also the following paragraph:
>> ---snip---
>> Along with this change in role, the nature of the IP routing
>> information that is carried has changed.  BGP has become a ubiquitous
>> means by which service information can be propagated between devices.
>> ---snip---
>> ... it may be better to say that the original scope of BGP wrt *just* 
>> exchanging reachability information has expanded to include functions that 
>> *were* typically in the domain of 'traditional' provisioning activities, and 
>> leave it at that.  (You are obviously familiar with things like BGP A-D + 
>> VPLS-LDP vs. VPLS-BGP vs. RFC 4364 ... where the former divorces service 
>> discovery [provisioning] from signaling/network reachability, whereas the 
>> latter two tightly couple service discovery & network reachability).
> 
> [rjs]: I'm not quite clear on whether this gets the point across completely - 
> do we think that it is just that things have become in the realm of 
> provisioning activities, or rather is it that there are more and more 
> functions that are overloading onto BGP. I agree that this sentence doesn't 
> necessarily capture that - but do you think that it's the generic information 
> transfer protocol between PEs, as well as replacing provisioning mechanisms?

I believe that you are correct, and better off, in stating "more and more 
functions that are overloaded (sic) onto BGP".  Although, I'm not sure that 
"overloaded" is an appropriate adjective.  

The point I was trying to get at is as follows.  I think there's a continuum of 
information exchanged within BGP from real-time information (reachability) to 
less dynamic (perhaps, even static) information, with _examples_ of the latter 
being auto-discovery/provisioning use cases.  While traditional applications, 
such as vanilla Internet service for which BGP was originally designed, only 
fall into the "real-time information" category ... there are a lot of new(er) 
applications that do not fit "neatly" in a single category and, in fact, span 
the range of real-time to less dynamic categories depending on which facet of a 
particular protocol you look at, (examples being: IPVPN, MVPN, VPLS-BGP, etc.). 
 Regardless, I don't think it's prudent to make value judgements (particularly 
at this point in time when these protocols are already widely deployed and 
successful) as to the "correctness" of these functions/services being in BGP, 
since that's bound to be very subjective.  Rather,
  we need to recognize the world for what it is today, which is why I think use 
of the word "overloaded" may be inappropriate.  Furthermore, I think that 
talking about this in such a context is only recognizing a symptom (the more 
complex the system, the higher the probability is to introduce errors), when in 
reality we should be trying to focus in on the root problem: since we've put so 
many eggs in one basket, we need unnoticeable (or, faster) recovery from errors 
that affect real-time, reachability information.



>> 2)  In Section 3, wrt the discussion on draft-ietf-idr-optional-transitive, 
>> it may be much more compelling to state that practically this solution has 
>> (currently) only been defined to be applicable to 3 out of the 29 total BGP 
>> path attributes that have been assigned by IANA, at this time.  And, even if 
>> you *could* to go farther, then only 10 out of the total 29 BGP path 
>> attributes are optional transitive.  So, the short story is: a lot more work 
>> in the area of session preservation
>> of iBGP sessions remains to be done, in particular wrt the other types of 
>> mandatory, discretionary, etc. attributes ... (assuming that such a thing is 
>> even technically possible with BGPv4, as we know it).
> 
> [rjs]: Agreed, it is definitely worth highlighting this. I amended the 
> paragraph mentioned to cover this:
> 
>                    <t>
>                        At the time of writing, error handling mechanisms 
> related to optional, transitive attributes - such as <xref 
> target='I-D.ietf-idr-optional-transitive' /> are restricted to handling only 
> a subset of attribute errors - whereas the operational requirement is to 
> expand this coverage to the widest set of errors possible (i.e., all semantic 
> errors within UPDATE messages). Additionally, where approaches applicable to 
> a greater number of attributes are proposed (e.g., <xref 
> target='I-D.chen-ebgp-error-handing'>), these are limited to deployment in 
> eBGP applications only, where requirements also exist in intra-domain cases. 
> As such, it is envisaged that if extended to cover these expanded cases, 
> these mechanisms provide a means to avoid the transmission of a NOTIFICATION 
> message to a remote BGP speaker, based on a single erroneous message, where 
> at all possible, and hence meet this requirement. Critical errors, including 
> those whereby the NLRI cannot be extrac
 ted from the UPDATE message, represent cases whereby the receiving system 
cannot handle the error gracefully based on this mechanism.
>                    </t>

OK.


>> [ snipped ]
>> 
>> 4)  In Section 4, I can't really discern a strong (?) preference for one of 
>> the associated recommendations to recovering RIB consistency.  I *think* 
>> what you should say, more clearly, is there is a strong preference toward 
>> narrow, targeted refresh mechanisms covering specific routes over those 
>> broader solutions (recovering an entire Adj-RIB-In), to reduce the
>> impact on CPU, DRAM, etc.  Perhaps you could add a sentence at the end of 
>> the 4th paragraph, where you discuss draft-zeng-one-time-prefix-orf, along 
>> the lines of what I just mentioned above?  The good news is you state this 
>> very clearly, but much later, in Section 7.  Perhaps you can pull this 
>> sentence from Section 7 up to Section 4, as well?
>> ---snip---
>> [...] It is recommended that where available, any automatic (or
>> manual) triggered recovery mechanism behaviour utilises such targeted
>> means in preference to any whole RIB refresh mechanism (such as
>> ROUTE-REFRESH).
>> ---snip--- 
> 
> [rjs]: Agreed, yes. I think that there is a strong preference for the 
> targeted mechanism in terms of scale. I rephrased the Section 3 text to be 
> stronger, and left in the Section 7 text:
> 
>       <t>
>               Whilst re-advertisement of the whole BGP RIB provides a means 
> by which withdrawn NLRI can be re-advertised, there are some scaling 
> implications that must be considered. In the case that a ROUTE-REFRESH is 
> generated, all NLRI must be re-packed into UPDATE messages and advertised by 
> one speaker on the BGP session, whilst the other must receive all UPDATE 
> messages, and validate the RIB's consistency. In order to avoid the 
> control-plane load, it is therefore a requirement to utilise targeted 
> mechanisms where possible, rather than incurring the additional load on both 
> the advertising and receiving speaker of building and processing UPDATEs for 
> the entire contents of the RIB.
>       </t>

OK.


>> 5)  Section 5 seems targeted solely at recreating the data structures at one 
>> side of a BGP session, (specifically a receiver of a BGP session).  However, 
>> who's to say the corruption is not actually in the Adj-RIB-Out of the BGP 
>> transmitter?  Thus, transmitting you an Adj-RIB-Out all over again, even 
>> "gracefully", is unlikely to do much good.  Perhaps it would be good to 
>> consider whether the "Receiving Speaker", in Graceful Restart terminology, 
>> should consider 'gracefully' purging and recreating his Adj-RIB-Out data 
>> structures, as well, before re-xmit'ing them to the "Restarting Speaker"?
> 
> [rjs]: Absolutely, this definitely wasn't handled in the draft -- thanks for 
> pointing it out. I added the following text - do you think this is sufficient?
> 
>                    <t>
>                        In some cases, the erroneous condition may be due to 
> corruption of the Adj-RIB-Out on the advertising BGP speaker - rather than 
> caused by the receiving speaker's state. In these cases, where existing 
> structures are replayed whilst performing graceful restart functionality, the 
> error condition is not necessarily resolved. Therefore, it is recommended 
> that during a session restart event, as described within this section, the 
> advertising speaker purge and rebuild RIB structures, in order to resolve any 
> corruption within these structures.
>                    </t>

OK.


>> Nits:
>> 1)  From <http://tools.ietf.org/html/rfc4271#section-3.1>:
>> ---snip---
>> For the purpose of this protocol, a route is defined as a unit of
>> information that pairs a set of destinations with the attributes of a
>> path to those destinations.  The set of destinations are systems
>> whose IP addresses are contained in one IP address prefix that is
>> carried in the Network Layer Reachability Information (NLRI) field of
>> an UPDATE message, and the path is the information reported in the
>> path attributes field of the same UPDATE message.
>> ---snip---
>> ... thus, to be more correct, I think you should scrub the document to say 
>> "route" in places where you're saying "prefix".  Some examples include:
>> 
>> http://tools.ietf.org/html/draft-ietf-grow-ops-reqs-for-bgp-error-handling-04#section-1.2
>> ---snip---
>> o  It is unacceptable within modern deployments of the BGP-4 protocol
>>  that a single erroneous UPDATE packet affects prefixes that it
>>  does not carry.  This requirement therefore requires some
>> ---snip---
>> s/affects prefixes/affects routes/
>> 
>> ---snip---
>> this reset of the BGP-4 session results in interruption to
>> forwarding packets (by means of withdrawing prefixes installed by
>> BGP-4 into a device's RIB, and subsequently FIB).  To this end,
>> ---snip---
>> s/withdrawing prefixes/withdrawing routes/
> 
> [rjs]: Thanks, I've done this to reflect this definition.
> 
>> 
>> 2)  In Section 2, this may be just wording, but I'm unclear on what you mean 
>> in the below where you say: "... handle the error in a manner focused on the 
>> NLRI contained within the message".  Are you trying to say something like 
>> "constrain the logging and/or ignoring of malformed UPDATE messages specific 
>> to the narrowest subset of routes for which malformed path attributes have 
>> been received?"  I'm not suggesting my wording is necessarily better, in 
>> this case, but perhaps you could look at being a bit more precise below?
>> ---snip---
>> Where an UPDATE message is considered invalid by a BGP speaker due to
>> an error within a path attribute that is not the NLRI (where the
>> definition of NLRI includes reachability information encoded in the
>> MP_REACH_NLRI and MP_UNREACH_NLRI attributes as specified in
>> [RFC4760]) it is a requirement of any enhanced error handling
>> mechanism to handle the error in a manner focused on the NLRI
>> contained within the message.  
>> ---snip---
> 
> [rjs]: Yes - the intention is to define this based on the narrowest set 
> possible, the reason that I used this wording is that (in my view) this is 
> defined by the NLRI actually in the message (if there were differing path 
> attributes for NLRI, then we expect that this is packed into a second UPDATE 
> message). Perhaps a hybrid of our wording would clarify this (unless you 
> think the assertion above is erroneous?).

I see your point now.  How about the following hybrid text?
---snip---
... it is a requirement of any enhanced error handling mechanism to constrain 
the error handling so that it is narrowly focused on the NLRI contained within 
the bad UPDATE message.
---snip---
 

>> 3)  Section 2:
>> ---snip---
>> contained within the message.  Since in this case, the message
>> received from the remote peer is syntactically valid, it is
>> considered that such an UPDATE is indicative of erroneous data within
>> a path attribute.  [...]
>> ---snip---
>> s/path attribute/path attributes/
> 
> [rjs]: Is the point here "one or more path attributes"? I'm not sure I quite 
> understand the nit? :-)

Yes, sorry: "one or more path attributes".  (My point was you can't predict, 
here anyway, that it will only a single path attribute that is a problem.  
Ideally, a more robust error-handling solution would not make such assumptions 
:-).


>> 4)  In Section 4, there do you really mean "transitive" here or do you 
>> instead mean "transient"?
>> ---snip---
>> It is of particular note for both means of recovering RIB consistency
>> described that these are effective only when considering transitive
>> errors within an implementation - for instance, should an RFC ...
>> ---snip---
> 
> [rjs]: Transient, thanks.
> 
>> ... also, in the following, do you mean "dynamic" instead of "transitive" in 
>> the below?
>> ---snip---
>> [...] It is not advisable that a transitive
>> filter and advertisement mechanism is triggered by all error handling
>> events due to the load this is likely to place on the neighbour
>> receiving such a request.  Where this BGP speaker is a relatively
>> ---snip---
> 
> [rjs]: Dynamic is definitely what is meant here.
> 
>> 
>> 5)  In Section 7.1, I think you probably want to s/NLRI/route/ in several 
>> places, e.g.:
>> ---snip---
>> ... upstream speaker to perform a best-path selection, and re-advertise a
>> new set of __NLRI__ before the downstream system is able to converge to a
>> new path.
>> ---snip---
> 
> [rjs]: Thanks, nit corrected.
> 
> Many thanks again for your comments - if you could cast your eyes over the 
> above corrections, and let me know if you feel they're sufficient, that'd be 
> fantastic.

And, thank you Rob for your excellent work on this.

-shane
_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Re: [GROW] WGLC: draft-ietf-grow-ops-reqs-for-bgp-error-handling-04

Reply via email to