I've read this draft and support it being published as an Informational RFC.

I do have some comments, which are not blocking, but that I do hope improve the 
clarity & readability of the draft.

1)  In Section 1.1, the following paragraph hints at the problems with current 
network architectures, but doesn't really come out and say what they are more 
directly.
---snip---
Traditional network architectures would deploy an Interior Gateway
Protocol (IGP) to carry infrastructure and customer prefixes, with an
Exterior Gateway Protocol (EGP) such as BGP being utilised to
propagate these prefixes to other Autonomous Systems.  [...]
---snip---
IMO, it would be better if the following section more directly said that scale, 
in terms of the sheer amount of routing information in BGP, has increased 
substantially, particularly in places like the core of a Route Reflection 
hierarchy with order _several_ million paths, and rising.  I wish I could point 
you at a link to a talk Danny & I gave to GROW and/or IDR at IETF 71 (?), but 
unfortunately I can't find a copy of the slides on the IETF's Web site.  You 
can see that it's in the GROW WG Agenda at IETF 71, but the slides link to a 
different presentation: 
<http://tools.ietf.org/wg/grow/agenda?item=agenda71.html>

Anyway, IMO, this is (from my vantage point) the single factor that makes 
outages not only much more visible, but /also/ (just as importantly) take 
longer to recover from ... (at the end of the day, CPU is finite, there are no 
"easy" answers wrt using multi-core processors to speed-up convergence of 
single AFI's, DRAM speeds are not keeping pace, etc.)  This may be useful to 
mention here to "set the stage", as it were, since throughout the draft there's 
a continuous (and, good) dialogue of not making convergence times worse or CPU 
load increase as a result of incorporating better recovery capabilities.  IOW, 
another benefit you don't really mention anywhere, but probably should, is that 
/theoretically/ some of these mechanisms may help further promote scalability 
of BGP.  Obviously, we won't know that until we are evaluating solutions and 
researchers have time to study them, etc.

Related to the above, there is also the following paragraph:
---snip---
Along with this change in role, the nature of the IP routing
information that is carried has changed.  BGP has become a ubiquitous
means by which service information can be propagated between devices.
---snip---
... it may be better to say that the original scope of BGP wrt *just* 
exchanging reachability information has expanded to include functions that 
*were* typically in the domain of 'traditional' provisioning activities, and 
leave it at that.  (You are obviously familiar with things like BGP A-D + 
VPLS-LDP vs. VPLS-BGP vs. RFC 4364 ... where the former divorces service 
discovery [provisioning] from signaling/network reachability, whereas the 
latter two tightly couple service discovery & network reachability).

2)  In Section 3, wrt the discussion on draft-ietf-idr-optional-transitive, it 
may be much more compelling to state that practically this solution has 
(currently) only been defined to be applicable to 3 out of the 29 total BGP 
path attributes that have been assigned by IANA, at this time.  And, even if 
you *could* to go farther, then only 10 out of the total 29 BGP path attributes 
are optional transitive.  So, the short story is: a lot more work in the area 
of session preservation
of iBGP sessions remains to be done, in particular wrt the other types of 
mandatory, discretionary, etc. attributes ... (assuming that such a thing is 
even technically possible with BGPv4, as we know it).

BTW, here's my list of optional, transitive attributes, with an "*" indicating 
those for which draft-ietf-idr-optional-transitive is applicable, in case 
you're wondering where I came up with my numbers above.  You don't need to 
include this in your draft.  I'm just providing it here in case someone wants 
to double-check my sources.
---snip---
7: Aggregator*
8: Community*
11: DPA - Destination Preference Attr
16: Extended Community*
17: AS4_PATH
18: AS4_Aggregator
22: PMSI-tunnel
23: Tunnel Encapsulation Attribute
25: IPv6 Address Specific Extended Community
128: Attribute Set
---snip---

4)  In Section 4, I can't really discern a strong (?) preference for one of the 
associated recommendations to recovering RIB consistency.  I *think* what you 
should say, more clearly, is there is a strong preference toward narrow, 
targeted refresh mechanisms covering specific routes over those broader 
solutions (recovering an entire Adj-RIB-In), to reduce the
impact on CPU, DRAM, etc.  Perhaps you could add a sentence at the end of the 
4th paragraph, where you discuss draft-zeng-one-time-prefix-orf, along the 
lines of what I just mentioned above?  The good news is you state this very 
clearly, but much later, in Section 7.  Perhaps you can pull this sentence from 
Section 7 up to Section 4, as well?
---snip---
[...] It is recommended that where available, any automatic (or
manual) triggered recovery mechanism behaviour utilises such targeted
means in preference to any whole RIB refresh mechanism (such as
ROUTE-REFRESH).
---snip--- 

5)  Section 5 seems targeted solely at recreating the data structures at one 
side of a BGP session, (specifically a receiver of a BGP session).  However, 
who's to say the corruption is not actually in the Adj-RIB-Out of the BGP 
transmitter?  Thus, transmitting you an Adj-RIB-Out all over again, even 
"gracefully", is unlikely to do much good.  Perhaps it would be good to 
consider whether the "Receiving Speaker", in Graceful Restart terminology, 
should consider 'gracefully' purging and recreating his Adj-RIB-Out data 
structures, as well, before re-xmit'ing them to the "Restarting Speaker"?




Nits:
1)  From <http://tools.ietf.org/html/rfc4271#section-3.1>:
---snip---
For the purpose of this protocol, a route is defined as a unit of
information that pairs a set of destinations with the attributes of a
path to those destinations.  The set of destinations are systems
whose IP addresses are contained in one IP address prefix that is
carried in the Network Layer Reachability Information (NLRI) field of
an UPDATE message, and the path is the information reported in the
path attributes field of the same UPDATE message.
---snip---
... thus, to be more correct, I think you should scrub the document to say 
"route" in places where you're saying "prefix".  Some examples include:

http://tools.ietf.org/html/draft-ietf-grow-ops-reqs-for-bgp-error-handling-04#section-1.2
---snip---
o  It is unacceptable within modern deployments of the BGP-4 protocol
   that a single erroneous UPDATE packet affects prefixes that it
   does not carry.  This requirement therefore requires some
---snip---
s/affects prefixes/affects routes/

---snip---
this reset of the BGP-4 session results in interruption to
forwarding packets (by means of withdrawing prefixes installed by
BGP-4 into a device's RIB, and subsequently FIB).  To this end,
---snip---
s/withdrawing prefixes/withdrawing routes/

2)  In Section 2, this may be just wording, but I'm unclear on what you mean in 
the below where you say: "... handle the error in a manner focused on the NLRI 
contained within the message".  Are you trying to say something like "constrain 
the logging and/or ignoring of malformed UPDATE messages specific to the 
narrowest subset of routes for which malformed path attributes have been 
received?"  I'm not suggesting my wording is necessarily better, in this case, 
but perhaps you could look at being a bit more precise below?
---snip---
Where an UPDATE message is considered invalid by a BGP speaker due to
an error within a path attribute that is not the NLRI (where the
definition of NLRI includes reachability information encoded in the
MP_REACH_NLRI and MP_UNREACH_NLRI attributes as specified in
[RFC4760]) it is a requirement of any enhanced error handling
mechanism to handle the error in a manner focused on the NLRI
contained within the message.  
---snip---

3)  Section 2:
---snip---
contained within the message.  Since in this case, the message
received from the remote peer is syntactically valid, it is
considered that such an UPDATE is indicative of erroneous data within
a path attribute.  [...]
---snip---
s/path attribute/path attributes/

4)  In Section 4, there do you really mean "transitive" here or do you instead 
mean "transient"?
---snip---
It is of particular note for both means of recovering RIB consistency
described that these are effective only when considering transitive
errors within an implementation - for instance, should an RFC ...
---snip---
... also, in the following, do you mean "dynamic" instead of "transitive" in 
the below?
---snip---
[...] It is not advisable that a transitive
filter and advertisement mechanism is triggered by all error handling
events due to the load this is likely to place on the neighbour
receiving such a request.  Where this BGP speaker is a relatively
---snip---

5)  In Section 7.1, I think you probably want to s/NLRI/route/ in several 
places, e.g.:
---snip---
... upstream speaker to perform a best-path selection, and re-advertise a
new set of __NLRI__ before the downstream system is able to converge to a
new path.
---snip---

Thanks,

-shane




On Jun 19, 2012, at 9:05 AM, George, Wes wrote:
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of
>> Christopher Morrow
>> Sent: Monday, June 11, 2012 4:22 PM
>> To: [email protected]; [email protected] [email protected]
>> Subject: [GROW] WGLC: draft-ietf-grow-ops-reqs-for-bgp-error-handling-04
>> 
>> Hello GROW-WG folk,
>> Please take this message as the start of a 2 week, ending 6/25/2012
>> (June 25, 2012) WGLC for the subject draft, link to current version:
>>  <http://www.ietf.org/internet-drafts/draft-ietf-grow-ops-reqs-for-bgp-
>> error-handling-04.txt>
>> 
>> Abstract:
>> "BGP-4 is utilised as a key intra- and inter-Autonomous System routing
>>   protocol in modern IP networks.  The failure modes as defined by the
>>   original protocol standards are based on a number of assumptions
>>   around the impact of session failure.  Numerous incidents both in the
>>   global Internet routing table and within Service Provider networks
>>   have been caused by strict handling of a single invalid UPDATE
>>   message causing large-scale failures in one or more Autonomous
>>   Systems.
>> 
>>   This memo describes the current use of BGP-4 within Service Provider
>>   networks, and outlines a set of requirements for further work to
>>   enhance the mechanisms available to a BGP-4 implementation when
>>   erroneous data is detected.  Whilst this document does not provide
>>   specification of any standard, it is intended as an overview of a set
>>   of enhancements to BGP-4 to improve the protocol's robustness to suit
>>   its current deployment."
>> 
>> -Chris
>> co-chair
>> _______________________________________________
>> GROW mailing list
>> [email protected]
>> https://www.ietf.org/mailman/listinfo/grow
> 
> This E-mail and any of its attachments may contain Time Warner Cable 
> proprietary information, which is privileged, confidential, or subject to 
> copyright belonging to Time Warner Cable. This E-mail is intended solely for 
> the use of the individual or entity to which it is addressed. If you are not 
> the intended recipient of this E-mail, you are hereby notified that any 
> dissemination, distribution, copying, or action taken in relation to the 
> contents of and attachments to this E-mail is strictly prohibited and may be 
> unlawful. If you have received this E-mail in error, please notify the sender 
> immediately and permanently delete the original and any copy of this E-mail 
> and any printout.
> _______________________________________________
> GROW mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/grow

_______________________________________________
GROW mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/grow

Reply via email to