Dear authors:

Thank you for documenting this important feature.

After reading the document, I think that it still needs work:

(1) The terminology used is not aligned with rfc4271.  Of major importance is
    the description of the Routing Table, the Decision Process, the use of best
    routes (not paths!), etc.  I pointed out multiple occurrences below, buy I
    need you to check the whole docuent for consistency.

(2) The choice to describe the mechanism using a complex scenario (VPNs, etc.)
    has resulted in the text being more complex than it needs to be.

    It is probably too late for this, but a better approach (especially because
    this is an Informational document) would have been to describe the
    functionality using a simple IP-only network.  A more elaborate example
    could have then been included in an Appendix.

(3) The text (mainly in §5) related to hardware limitations is interesting but
    also highly speculative.  Given that it is not part of the main mechanism,
    I would suggest putting this information in an appendix.


I put detailed comments below.


To the Chairs: I couldn't find a request to the idr WG for review.
Did I miss it?  Given the subject, we need to give idr the opportunity
to comment.


Thanks!

Alvaro.


[Line numbers from idnits.]

...
12      Abstract

14      In a network comprising thousands of BGP peers exchanging millions of
15      routes, many routes are reachable via more than one next-hop. Given
16      the large scaling targets, it is desirable to restore traffic after
17      failure in a time period that does not depend on the number of BGP
18      prefixes. In this document we proposed an architecture by which
19      traffic can be re-routed to ECMP or pre-calculated backup paths in a
20      timeframe that does not depend on the number of BGP prefixes. The
21      objective is achieved through organizing the forwarding data
22      structures in a hierarchical manner and sharing forwarding elements
23      among the maximum possible number of routes. The proposed technique
24      achieves prefix independent convergence while ensuring incremental
25      deployment, complete automation, and zero management and provisioning
26      effort. It is noteworthy to mention that the benefits of BGP Prefix
27      Independent Convergence (BGP-PIC) are hinged on the existence of more
28      than one path whether as ECMP or primary-backup.

[style nit] "we proposed"

Please don't write using the first person ("we propose"), use the
third person instead ("this document proposes").


[minor] s/we proposed an architecture/describes an architecture

Other places also say that the document "proposes" -- at this point in
the process we're past proposals. :-)


[minor] Please expand ECMP in the Abstract, and later on first use.


[] "The proposed technique achieves prefix independent convergence
while ensuring incremental deployment, complete automation, and zero
management and provisioning effort."

Great marketing!  This is a technical document, please avoid selling
the solution.


[minor] Consider breaking the Abstract into two paragraphs.



...
109     1. Introduction

111        BGP speakers exchange reachability information about
112        prefixes[1][2] and, for labeled address families, namely AFI/SAFI
113        1/4, 2/4, 1/128, and 2/128, an edge router assigns local labels to
114        prefixes and associates the local label with each advertised
115        prefix using technologies such as L3VPN [9], 6PE [10], and
116        Softwire [8] using BGP label unicast (BGP-LU) technique[3]. A BGP
117        speaker then applies the path selection steps to choose the best
118        path. In modern networks, it is not uncommon to have a prefix
119        reachable via multiple edge routers. In addition to proprietary
120        techniques, multiple techniques have been proposed to allow for
121        BGP to advertise more than one path for a given prefix
122        [7][12][13], whether in the form of equal cost multipath or
123        primary-backup. Another common and widely deployed scenario is
124        L3VPN with multi-homed VPN sites with unique Route Distinguisher.
125        It is advantageous to utilize the commonality among paths used by
126        NLRIs[1] to significantly improve convergence in case of topology
127        modifications.

[] I still haven't read the rest of the document...  PIC is a valuable
and simple technique.  Just from the first sentence (and peeking ahead
at some of the figures and examples), it seems to me that the
description could have been significantly simpler. :-(


[major] rfc7322 requires that RFCs be referenced as [RFCXXXX], please
update the citations.


[minor nit] Multiple citations don't have spaces around the brackets ("[]").


[minor] The reference to rfc4271 is the general reference for BGP, no
need to include a reference to rfc4760 with it.


[] "for labeled address families...L3VPN with multi-homed VPN sites
with unique Route Distinguisher"

The detail about labeled AFs/L3VPN seems unnecessary given that PIC
doesn't just apply to them.


[major] "BGP label unicast (BGP-LU) technique[3]"

Note that rfc8277 doesn't use the terms BGP-LU, "label unicast", or
"labeled unicast".  I know that BGP-LU is commonly used, but it is not
defined in any IETF document (that I could find).  Please define it in
the terminology section.  Also, don't refer to rfc8277 every time you
use the term.


[major] "A BGP speaker then applies the path selection steps to choose
the best path."

rfc4271 doesn't talk about "path selection" or finding a "best path".
It specifies a Decision Process that results in best routes.  Please
use consistent terminology.


[minor] "In addition to proprietary techniques..."

These proprietary techniques are not described anywhere.  Please don't
mention them -- it will only raise unnecessary questions.


[nit] "techniques have been proposed...[7][12][13]"

Add-path and Diverse paths are not just "proposed".


[minor] "It is advantageous to utilize the commonality among paths
used by NLRIs[1] to significantly improve convergence in case of
topology modifications."

You may have to be more explicit about which commonality -- mention it
explicitly.



129        This document proposes a hierarchical and shared forwarding chain
130        organization that allows traffic to be restored to pre-calculated
131        alternative equal cost primary path or backup path in a time
132        period that does not depend on the number of BGP prefixes. The
133        technique relies on internal router behavior that is completely
134        transparent to the operator and can be incrementally deployed and
135        enabled with zero operator intervention. In other words, once it
136        is implemented and deployed on a router, nothing is required from
137        the operator to make it work. It is noteworthy to mention that
138        this document describes FIB architecture that can be implemented
139        in both hardware and/or software.

[nit] s/to pre-calculated/to a pre-calculated


[nit] s/describes FIB/describes a FIB


[minor] Expand FIB on first mention.



141     1.1. Terminology

[major] Please don't deviate (at least) from terminology used in
standards track specifications.  Please don't redefine terms to mean
something different in this document.

If using terminology defined elsewhere, please just reference those
documents, and don't redefine the terms here.



143        This section defines the terms used in this document. For ease of
144        use, we will use terms similar to those used by L3VPN [9].

[major] Similar?  If you're using the terminology from rfc4364, then
please just use it.  Redefining terms doesn't make anything easier.

To be clear: remove any terms that are already defined in rfc4364 and
make the reference Normative.



146        o  BGP prefix: A prefix P/m (of any AFI/SAFI) that a BGP speaker
147           has a path for.

[major] rfc4271 uses the concept of an "IP prefix" carried in a
"route" and described by a "path" (from §1.1):

   Route
      A unit of information that pairs a set of destinations with the
      attributes of a path to those destinations.  The set of
      destinations are systems whose IP addresses are contained in one
      IP address prefix carried in the Network Layer Reachability
      Information (NLRI) field of an UPDATE message.  The path is the
      information reported in the path attributes field of the same
      UPDATE message.



149        o  IGP prefix: A prefix P/m (of any AFI/SAFI) that is learnt via
150           an Interior Gateway Protocol (IGP), such as OSPF and ISIS. The
151           prefix may be learnt directly through the IGP or redistributed
152           from other protocol(s).

[major] rfc4271 uses the term "IGP route".


[minor] The "P/m" notation is not explained -- and is not used outside
of this section.


[major] "is learnt via an...IGP...may be learnt directly through the
IGP or redistributed from other protocol(s)"

There's a conflict in the definition: does it come directly from the
IGP or from something else?  Might might those "other protocol(s)" be?



154        o  CE[7]: An external router through which an egress PE can reach
155           a prefix P/m.

[minor] Please expand CE.


[major] The reference to draft-ietf-idr-best-external seems incorrect;
that draft doesn't use CE at all.  This seems to be a term that comes
from RFC4364.


[minor] Please expand PE in first use.



157        o  Egress PE[7], "ePE": A BGP speaker that learns about a prefix
158           through an eBGP peer and chooses that eBGP peer as the next-hop
159           for that prefix.

[major] The reference to draft-ietf-idr-best-external seems incorrect;
that draft doesn't use PE at all.  This seems to be a term that comes
from RFC4364.


[minor] s/eBGP/EBGP/g
That is how rfc4271 uses it.


[minor] Expand EBGP on first use.



161        o  Ingress PE, "iPE": A BGP speaker that learns about a prefix
162           through a iBGP peer and chooses an egress PE as the next-hop for
163           the prefix.

[minor] s/iBGP/IBGP/g
That is how rfc4271 uses it.


[minor] Expand IBGP on first use.



165        o  Path: The next-hop in a sequence of nodes starting from the
166           current node and ending with the destination node or network
167           identified by the prefix. The nodes may not be directly
168           connected.

[major] See the definition above from rfc4271.


[major] Some of the definitions refer to BGP, but others to the FIB
characteristics....differentiate.  "FIB Path" vs "BGP Path", for
example.  Note that rfc4271 uses the term "Routing Table" as follows
(from §3.2: Routing Information Base):

   Routing information that the BGP speaker uses to forward packets (or
   to construct the forwarding table used for packet forwarding) is
   maintained in the Routing Table.  The Routing Table accumulates
   routes to directly connected networks, static routes, routes learned
   from the IGP protocols, and routes learned from BGP.  Whether a
   specific BGP route should be installed in the Routing Table, and
   whether a BGP route should override a route to the same destination
   installed by another source, is a local policy decision, and is not
   specified in this document.  In addition to actual packet forwarding,
   the Routing Table is used for resolution of the next-hop addresses
   specified in BGP updates (see Section 5.1.3).



170        o  Recursive path: A path consisting only of the IP address of the
171           next-hop without the outgoing interface. Subsequent lookups are
172           necessary to determine the outgoing interface and a directly
173           connected next-hop.

[major] rfc4271 already used "recursive route".



175        o  Non-recursive path: A path consisting of the IP address of a
176           directly connected next-hop and outgoing interface.

[] See above.



...
181        o  Primary path: A recursive or non-recursive path that can be
182           used all the time as long as a walk starting from this path can
183           end to an adjacency. A prefix can have more than one primary
184           path.

[minor] "can be used all the time"

I'm not sure what you mean here.


[minor] The concept of a "walk" hasn't been introduced -- please put a
pointer to where that is discussed.



...
189        o  Leaf: A container data structure for a prefix or local label.
190           Alternatively, it is the data structure that contains prefix
191           specific information.

[] So a leaf can be many things: "prefix or local label...[or] prefix
specific information".  Hmmm....



...
198        o  Pathlist: An array of paths used by one or more prefixes to
199           forward traffic to destination(s) covered by an IP prefix. Each
200           path in the pathlist carries its "path-index" that identifies its
201           position in the array of paths. In general, the value of the
202           "path-index" stored in path may not necessarily have the same
203           value of the location of the path in the pathlist. For example
204           the 3rd path may carry path-index value of 1. A pathlist may
205           contain a mix of primary and backup paths.

[nit] s/its "path-index"/a "path-index"



207        o  OutLabel-List: Each labeled prefix is associated with an
208           OutLabel-List. The OutLabel-List is an array of one or more
209           outgoing labels and/or label actions where each label or label
210           action has 1-to-1 correspondence to a path in the pathlist.
211           Label actions[6] are: push the label, pop the label, swap the
212           incoming label with the label in the OutLabel-List entry, or
213           don't push anything at all in case of "unlabeled". The prefix
214           may be an IGP or BGP prefix.

[major] If the label label actions are defined in rfc3031, please just
point to it -- is the "swap the incoming label with the label in the
OutLabel-List entry" defined there too?

It seems to me that the action defined here is part of a general
action taken by the FIB, and not specific only to labels (as in a
different encapsulation, for example).



216        o  Forwarding chain: It is a compound data structure consisting of
217           multiple connected block that a forwarding engine walks one
218           block at a time to forward the packet out of an interface.
219           Section 2.2 explains an example of a forwarding chain.
220           Subsequent sections provide additional examples

[nit] s/multiple connected block/multiple connected blocks


[nit] s/additional examples/additional examples.


[minor] Note that both "forwarding chain" and "forwarding Chain" are
used throughout the text.  Please be consistent.



...
229        o  Route: A prefix with one or more paths associated with it.  The
230           minimum set of objects needed to construct a route is a leaf
231           and a pathlist.

[major] See the comments above about rfc4271 and the FIB in general.



233     2. Overview

235        The idea of BGP-PIC is based on two pillars

[minor] The descriptions below are very thick (hard to read through)
without more context.  Suggestion: keep the description simple in this
section and point to where there is more detail.



236        o  A shared hierarchical forwarding chain: It is not uncommon to see
237           multiple destinations reachable via the same list of next-hops.
238           Instead of having a separate list of next-hops for each
239           destination, all destinations sharing the same list of next-hops
240           can point to a single copy of this list thereby allowing fast
241           convergence by making changes to a single shared list of next-
242           hops rather than possibly a large number of destinations. Because
243           paths in a pathlist may be recursive, a hierarchy is formed
244           between pathlist and the resolving prefix whereby the pathlist
245           depends on the resolving prefix.

247        o  A forwarding plane that supports multiple levels of indirection:
248           A forwarding chain that starts with a destination and ends with
249           an outgoing interface is not a simple flat structure. Instead a
250           forwarding entry is constructed via multiple levels of
251           dependency. A BGP NLRI uses a recursive next-hop, which in turn
252           resolves via an IGP next-hop, which in turn resolves via an
253           adjacency consisting of one or more outgoing interface(s) and
254           next-hop(s).

[nit] s/Instead a/Instead, a


[minor] Are "multiple levels of indirection" and "multiple levels of
dependency" the same thing?  The latter is used to describe the
former.


[] Next-hop is also a term that needs to be differentiated.  In this
paragraph is it used in the context of a BGP route, an IGP route, and
a local address on the FIB.  See rfc4271.



256        Designing a forwarding plane that constructs multi-level forwarding
257        chains with maximal sharing of forwarding objects allows rerouting a
258        large number of destinations by modifying a small number of objects
259        thereby achieving convergence in a time frame that does not depend
260        on the number of destinations. For example, if the IGP prefix that
261        resolves a recursive next-hop is updated there is no need to update
262        the possibly large number of BGP NLRIs that use this recursive next-
263        hop.

[] The example is orders of magnitue simpler that the actual
explanation.  BTW, none of the words used ("multi-level forwarding
chains with maximal sharing of forwarding objects") are mentioned in
the description of the pillars.



265     2.1. Dependency

267        This section describes the required functionalities in the
268        forwarding and control planes to support BGP-PIC described in this
269        document.

[nit] s/BGP-PIC described/BGP-PIC as described



271     2.1.1. Hierarchical Hardware FIB (Forwarding Information Base)

[major] The Introduction says that "this document describes FIB
architecture that can be implemented in both hardware and/or
software", but this section requires a hardware implementation.  ??



273        BGP-PIC requires a hierarchical hardware FIB support: for each BGP
274        forwarded packet, a BGP leaf is looked up, then a BGP Pathlist is
275        consulted, then an IGP Pathlist, then an Adjacency.

[major] "BGP forwarded packet"

I think you mean something along the lines of a packet destined to an
external destination, or to a destination learned through BGP, or ...
??


[major] The definitions of both a leaf and a pathlist don't assume
protocol origin.  IOW, the reader has to make significant assumptions
to understand that you mean "a leaf/pathlist that resulted from
BGP/IGP-learned information" (or something like that).  If the
identification of the origin protocol is required, please also say so.


[major] "Pathlist" and "Adjancency" are capitalized, but "leaf" isn't.
Personally, I like capitalizing terms with specific meaning.  Not all
instances (throughout the document) are treated the same.  Please be
consistent throughout.


[major] The general process above: lookup the destination (BGP leaf),
which points to possible alternatives (BGP pathlist), which
recursively resolve...  assumes knowledge of how a FIB is generally
built and the operations that forwarding a packet requires.  This may
not be common knowledge to every reader.  Please either explain the
assumptions or reference a place where the explanation exists already.



277        An alternative method consists in "flattening" the dependencies when
278        programming the BGP destinations into HW FIB resulting in
279        potentially eliminating both the BGP Path-List and IGP Path-List
280        consultation. Such an approach decreases the number of memory
281        lookups per forwarding operation at the expense of HW FIB memory
282        increase (flattening means less sharing thereby less duplication),
283        loss of ECMP properties (flattening means less pathlist entropy) and
284        loss of BGP-PIC properties.

[minor] HW is mentioned here again...and the concept of programming,
which goes back to the general knowledge assumptions that you're
making.


[minor] s/Path-List/Pathlist/g


[major] Am I to assume that this "alternative method" shouldn't be
considered/used, especially if it results in loss of the "BGP-PIC
properties"??  This section started by mentioning a requirement -- if
this other approach is not what is required the why even mention it?



286     2.1.2. Availability of more than one BGP next-hops

288        When the primary BGP next-hop fails, BGP-PIC depends on the
289        availability of one or more pre-computed and pre-installed secondary
290        BGP next-hop(s) in the BGP Pathlist.

[major] "primary BGP next-hop"

At this point in the document you've hinted at the possibility of
having multiple BGP routes (with potentially different NEXT_HOPs), but
the concept of "primary BGP next-hop" hasn't been explained.  I'm
assuming that by this you mean the BGP best route [rfc4271].  (Again,
you're making assumptions of the knowledge of the reader.)

I can see that this section talks about "secondary next-hops" later.
Perhaps change the order of the text so that the conditions precede
the explanation of the actions.


[major] "BGP next-hop fails"

rfc4271 talks about the the address in the NEXT_HOP attribute becoming
not resolvable.  Please use established terminology!


[major] "pre-computed and pre-installed...in the BGP Pathlist"

There's an example of "pre-computed" later on, but I couldn't find an
explanation of "pre-installed...in the BGP Pathlist".



292        The existence of a secondary next-hop is clearly required for the
293        following reason: a service caring for network availability will
294        require two disjoint network connections resulting in two BGP next-
295        hops.

[minor] You mean "clearly required" for BGP-PIC, right?  Not for the service.



297        The BGP distribution of secondary next-hops is available thanks to
298        the following BGP mechanisms: Add-Path [12], BGP Best-External [7],
299        diverse path [13], and the frequent use in VPN deployments of
300        different VPN RD's per PE. Another option to learn multiple BGP
301        NH/path is simply to receive IBGP paths from multiple BGP RR
302        selection a different path as best. It is noteworthy to mention that
303        the availability of another BGP path does not mean that all failure
304        scenarios can be covered by simply forwarding traffic to the
305        available secondary path. The discussion of how to cover various
306        failure scenarios is beyond the scope of this document.

[minor] "BGP NH/path" is a new term.  Even if using "NH" seems
obvious, please mention it at least once.  Also, you've been talking
(incorrectly) about BGP paths (instead of routes) or BGP next-hops and
now you're combining the two.  This is the only instance.


[nit] s/is simply to receive/is to receive


[nit] s/multiple BGP RR selection/multiple BGP RRs selecting


[minor] You might want to reference rfc9107 as an example of "multiple
BGP RRs selecting a different path as best".



308     2.2. BGP-PIC Illustration

310        To illustrate the two pillars above as well as the platform
311        dependency, we will use an example of a simple multihomed L3VPN [9]
312        prefix in a BGP-free core running LDP [4] or segment routing over
313        MPLS forwarding plane [5].

[nit] s/a simple multihomed/a multihomed

What is "simple" to you may not be simple to others.  Just state the facts.


[minor] s/L3VPN [9]/L3VPN
The citation is not needed all the time.


[minor] "core running LDP [4] or segment routing over MPLS forwarding plane [5]"

This is an example.  To simplify, pick one.



315         +--------------------------------+
316         |                                |
317         |                               ePE2 (IGP-IP1 192.0.2.1, Loopback)
318         |                                |  \
319         |                                |   \
320         |                                |    \
321        iPE                               |    CE....VRF "Blue", ASnum 65000
322         |                                |    /   (VPN-IP1 198.51.100.0/24)
323         |                                |   /    (VPN-IP2 203.0.113.0/24)
324         |   LDP/Segment-Routing Core     |  /
325         |                               ePE1 (IGP-IP2 192.0.2.2, Loopback)
326         |                                |
327         +--------------------------------+
328                  Figure 1 VPN prefix reachable via multiple PEs

[nit] s/ASnum/ASN

Expand on first use.



330        Referring to Figure 1, suppose the iPE (the ingress PE) receives
331        NLRIs for the VPN prefixes VPN-IP1 and VPN-IP2 from two egress PEs,
332        ePE1 and ePE2 with next-hop BGP-NH1 and BGP-NH2, respectively.
333        Assume that ePE1 advertise the VPN labels VPN-L11 and VPN-L12 while
334        ePE2 advertise the VPN labels VPN-L21 and VPN-L22 for VPN-IP1 and
335        VPN-IP2, respectively. Suppose that BGP-NH1 and BGP-NH2 are resolved
336        via the IGP prefixes IGP-IP1 and IGP-IP2, where each happen to have
337        2 equal cost paths with IGP-NH1 and IGP-NH2 reachable via the
338        interfaces I1 and I2, respectively. Suppose that local labels
339        (whether LDP [4] or segment routing [5]) on the downstream LSRs[4]
340        for IGP-IP1 are IGP-L11 and IGP-L12 while for IGP-IP2 are IGP-L21
341        and IGP-L22. As such, the routing table at iPE is as follows:

[minor] "receives NLRIs"

I other places you're talked about paths or routes...please be consistent!


[minor] "IGP-NH1 and IGP-NH2 reachable via the interfaces I1 and I2"

I'm sure you're talking about interfaces on the iPE, but others may
find that the picture is lacking detail when compared to the
description: the core is just a box..



343              65000: 198.51.100.0/24
344                 via ePE1 (192.0.2.1), VPN Label: VPN-L11
345                 via ePE2 (192.0.2.2), VPN Label: VPN-L21

[minor] You need to explain the notation and, for example, why some
entries have an ASN and others don't.


[minor] It might help if the description included the IP addresses.
For example, when mentioning BGP-NH1, indicate that it refers to
192.0.2.1.



...
362        IP Leaf:    Pathlist:    IP Leaf:           Pathlist:
363        --------  +-------+     --------          +----------+
364        VPN-IP1-->|BGP-NH1|-->IGP-IP1(BGP NH1)--->|IGP-NH1,I1|--->Adjacency1
365          |       |BGP-NH2|-->....      |         |IGP-NH2,I2|--->Adjacency2
366          |       +-------+             |         +----------+
367          |                             |
368          |                             |
369          v                             v
370        OutLabel-List:                OutLabel-List:
371        +----------------------+      +----------------------+
372        |VPN-L11 (VPN-IP1, NH1)|      |IGP-L11 (IGP-IP1, NH1)|
373        |VPN-L21 (VPN-IP1, NH2)|      |IGP-L12 (IGP-IP1, NH2)|
374        +----------------------+      +----------------------+

376                Figure 2 Shared Hierarchical Forwarding Chain at iPE

378        The forwarding chain depicted in Figure 2 illustrates the first
379        pillar, which is sharing and hierarchy. We can see that the BGP
380        pathlist consisting of BGP-NH1 and BGP-NH2 is shared by all NLRIs
381        reachable via ePE1 and ePE2. As such, it is possible to make changes
382        to the pathlist without having to make changes to the NLRIs. For
383        example, if BGP-NH2 becomes unreachable, there is no need to modify
384        any of the possibly large number of NLRIs. Instead only the shared
385        pathlist needs to be modified. Likewise, due to the hierarchical
386        structure of the forwarding chain, it is possible to make
387        modifications to the IGP routes without having to make any changes
388        to the BGP NLRIs. For example, if the interface "I2" goes down, only
389        the shared IGP pathlist needs to be updated, but none of the IGP
390        prefixes sharing the IGP pathlist nor the BGP NLRIs using the IGP
391        prefixes for resolution need to be modified.

[minor] "We can see that the BGP pathlist consisting of BGP-NH1 and
BGP-NH2 is shared by all NLRIs reachable via ePE1 and ePE2."

I only see VPN-IP1 using that pathlist.

I know it is not easy to draw using ASCII, but you may want to provide
a conceptual figure showing how the leafs and Pathlists relate to each
other and then describe the contents.  OR  You can also integrate an
SVI drawing in the XML.


[minor] The representation of the Pathlists is not consistent: the
first one shows just BGP-NHs, while the second one has an extra
element.  It may be obvious to you that the interface element points
at the corresponding Adjacencies, but please explain that too.


[major] The description is not complete -- it jumps from using one
Pathlist to the other, but it doesn't cover the recursive nature of
the resolution: resolving the BGP-NHs in the Pathlist through another
leaf.


[major] The outlabel-list is not explained anywhere.



393        Figure 2 can also be used to illustrate the second BGP-PIC pillar.
394        Having a deep forwarding chain such as the one illustrated in Figure
395        2 requires a forwarding plane that is capable of accessing multiple
396        levels of indirection in order to calculate the outgoing
397        interface(s) and next-hops(s). While a deeper forwarding chain
398        minimizes the re-convergence time on topology change, there will
399        always exist platforms with limited capabilities and hence imposing
400        a limit on the depth of the forwarding chain. Section 5 describes
401        how to gracefully trade off convergence speed with the number of
402        hierarchical levels to support platforms with different
403        capabilities.

[minor] "Figure 2 can also be used to illustrate the second BGP-PIC pillar."

Ok.  But the rest of the paragraph doesn't explain how that is needed
in the example.



405        Another example using IPv6 addresses can be something like the
406        following

408              65000: 2001:DB8:1::/48
409                 via ePE1 (192::1), VPN Label: VPN6-L11
410                 via ePE2 (192::2), VPN Label: VPN6-L21

412              65000: 2001:DB8:2:/48
413                 via ePE1 (192::1), VPN Label: VPN6-L12
414                 via ePE2 (192::2), VPN Label: VPN6-L22

416              192::1/128
417                 via Core, Label:  IGP6-L11
418                 via Core, Label:  IGP6-L12

420              192::2/128
421                 via Core, Label:  IGP6-L21
422                 via Core, Label:  IGP6-L22

424        The same hierarchical forwarding chain described can be constructed
425        for IPv6 addresses/prefixes.

[major] Please use the addresses from rfc3849.


[] BTW, sorry to be blunt, but this is a lazy attempt at using IPv6 in
an example.  You should at least show a figure and maybe a different
scenario...not just similar numbers.



427     3. Constructing the Shared Hierarchical Forwarding Chain

429        Constructing the forwarding chain is an application of the two
430        pillars described in Section 2. This section describes how to
431        construct the forwarding chain in hierarchical shared manner.

[nit] s/in hierarchical/in a hierarchical



433     3.1. Constructing the BGP-PIC forwarding Chain

435        The whole process starts when BGP downloads a prefix to FIB. The
436        prefix contains one or more outgoing paths. For certain labeled
437        prefixes, such as VPN [9] prefixes, each path may be associated with
438        an outgoing label and the prefix itself may be assigned a local
439        label. The list of outgoing paths defines a pathlist. If such
440        pathlist does not already exist, then FIB manager (software or
441        hardware entity responsible for managing the FIB) creates a new
442        pathlist, otherwise the existing pathlist is used. The BGP prefix is
443        added as a dependent of the pathlist.

[major] "BGP downloads a prefix to FIB"

rfc4271 talks about "routes...installed in the Routing Table".
Please, as much as possible, use the terminology from established
RFCS.  Note that this document doesn't define what a FIB is.


[major] "The prefix contains one or more outgoing paths."

This may be another terminology issue: a BGP route doesn't contain
multiple next-hops.  If the routes came from ADD_PATH or ORR, etc.,
there will be multiple BGP routes with a common NLRI and different
next-hops.  IOW, it looks like the Pathlist would be built
incrementally as routes for the same NLRI are installed.

The end result should be the same.  Terminology matters!


[minor] "VPN [9]"

The other references to rfc4364 use "L3VPN", should this one use it too?


[major] "If such pathlist does not already exist..."

It should be mentioned somewhere the reason for this Pathlist to
already exist: there's probably another BGP route using it.  This is
where the connection to the shared nature of the chain needs to be
made.  Otherwise, the description sounds as if there might simply be a
Pathlist structure available...


[nit] s/FIB manager/the FIB manager/g


[] It might be easier/better/clearer to explain the process as a
series of steps.  Note that the description above is really a series
of steps: BGP has a route to install, the route is accepted in the
routing table (local policy), a leaf for this NLRI exists (or not),
the Pathlist exists (or not), etc...



445        The previous step constructs the upper part of the hierarchical
446        forwarding chain. The forwarding chain is completed by resolving the
447        paths of the pathlist. A BGP path usually consists of a next-hop.
448        The next-hop is resolved by finding a matching IGP prefix.

[major] "The forwarding chain is completed by resolving the paths of
the pathlist."

The example in the last section shows that this "lower part" (is that
the right term?) could also include a Pathlist.  The explanation of
how the "upper part" is constructed didn't mention that the process
was to "resolve" the BGP NLRI.  As a result, there's a disconnect
between what was done before and what is assumed here.

Again, a series of steps would serve the purpose better.

Also, be mindful of the use of "resolve" in rfc4271.


[major] "A BGP path usually consists of a next-hop."

As opposed to what?  I mean the NEXT_HOP attribute is mandatory, a
route includes multiple other Path Attributes...  Not sure what the
point it.


[minor] "The next-hop is resolved by finding a matching IGP prefix."

Yes, as an extreme simplification.  §9.1.2.1/rfc4271 offers a better
description.



450        The end result is a hierarchical shared forwarding chain where the
451        BGP pathlist is shared by all BGP prefixes that use the same list of
452        paths and the IGP prefix is shared by all pathlists that have a path
453        resolving via that IGP prefix. It is noteworthy to mention that the
454        forwarding chain is constructed without any operator intervention at
455        all.

[minor] "The end result is a hierarchical shared forwarding chain
where the BGP pathlist is shared..."

As I mentioned above, more emphasis should be places on (when
explaining the steps) the shared nature of the Pathlists and how its
dependents are independent of each other too.


[minor] "without any operator intervention"

Yes, this was mentioned as an advantage somewhere.  It would be better
if the document was explicit from the start on the point that it is
describing a set of abstract structures and procedures to be
internally instantiated.  Just like protocol structures (neighbors,
routes, etc.) that are transparent to the user.



...
460     3.2. Example: Primary-Backup Path Scenario

462        Consider the egress PE ePE1 in the case of the multi-homed VPN
463        prefixes in the BGP-free core depicted in Figure 1. Suppose ePE1
464        determines that the primary path is the external path but the backup
465        path is the iBGP path to the other PE ePE2 with next-hop BGP-NH2.
466        ePE2 constructs the forwarding chain depicted in Figure 3. We are
467        only showing a single VPN prefix for simplicity. But all prefixes
468        that are multihomed to ePE1 and ePE2 share the BGP pathlist.

[minor] "BGP-free core depicted in Figure 1"

It doesn't really make a difference, but it wasn't mentioned before
that Figure 1 has a BGP-free core.  I'm mentioning this because there
are inconsistencies in the description.  Also, this is a detail that
is not important to the description, but that can create confusion for
other readers and generate unnecessary questions and comments.


[minor] "Suppose ePE1 determines..."

It would be great if there was either a pointer to how ePE1 does all
this (where are backups specified?), or a little bit more explanation
of the assumptions.


[nit] s/external path but the backup/external path, while the backup



470                         BGP OutLabel-List
471          VPN-L11            +---------+
472        (Label-leaf)---+---->|Unlabeled|
473                       |     +---------+
474                       v     | VPN-L21 |
475                       |     | (swap)  |
476                       |     +---------+
477                       |
478                       |                    BGP Pathlist
479                       |                   +------------+    Connected route
480                       |                   |   CE-NH    |------>(to the CE)
481                       |                   |path-index=0|
482                       |                   +------------+
483                       |                   |  VPN-NH2   |
484          VPN-IP1 -----+------------------>|  (backup)  |------>IGP Leaf
485        (IP leaf)                          |path-index=1|    (Towards ePE2)
486             |                             +------------+
487             |
488             |           BGP OutLabel-List
489             |              +---------+
490             +------------->|Unlabeled|
491                            +---------+
492                            | VPN-L21 |
493                            | (push)  |
494                            +---------+

496        Figure 3 : VPN Prefix Forwarding Chain with eiBGP paths on egress PE

[minor] Figure 2 provided an easy-to-follow dependency order: Leaf to
Pathlist  to Leaf to Pathlist.  The dependencies in Figure 3 are not
straightforward, please provide a description of the figure before
explaining the differences with respect to the other example.


[major] The definition of Pathlist (§1.1) implies that all its entries
can be used for forwarding, but the Pathlist above (and this use case)
clearly indicate that not all the entries in a Pathlist are used in
the same way, or at the same time.



498        The example depicted in Figure 3 differs from the example in Figure
499        2 in two main aspects. First, as long as the primary path towards
500        the CE (external path) is useable, it will be the only path used for
501        forwarding while the OutLabel-List contains both the unlabeled label
502        (primary path) and the VPN label (backup path) advertised by the
503        backup path ePE2. The second aspect is presence of the label leaf
504        corresponding to the VPN prefix. This label leaf is used to match
505        VPN traffic arriving from the core. Note that the label leaf shares
506        the pathlist with the IP prefix.

[major] "the...path...is useable"

What does "useable" mean?  Please describe this term in the terminology section.


[major] "primary path towards the CE (external path) is useable, it
will be the only path used"

The arrow from VPN-IP1 points to the VPN-NH2 entry in the Pathlist, so
the description doesn't seem to match.


[major] "the OutLabel-List"

There are two boxes named "BGP OutLabel-List", and they contain
different backup actions (swap vs push).


[minor] "unlabeled label"

rfc3031 talks about unlabeled packets, not labels.


[minor] A third difference is that this figure includes a "path-index"
-- but that is not explained.



508     4. Forwarding Behavior
...
513        When a packet arrives at a router, it matches a leaf. A labeled
514        packet matches a label leaf while an IP packet matches an IP leaf.
515        The forwarding engines walks the forwarding chain starting from the
516        leaf until the walk terminates on an adjacency. Thus when a packet
517        arrives, the chain is walked as follows:

[minor] Instead of summarizing, simply start with the steps.



519        1. Lookup the leaf based on the destination address or the label at
520           the top of the packet.

[major] Lookup how?  How do you know you found the right leaf?

I think rfc3031 has some text related to the forwarding of labeled
packets.  But I don't know of a reference for IP forwarding, so you
should at least mention something around a longest-prefix-match.



522        2. Retrieve the parent pathlist of the leaf.

[minor] While it is related, the terminology section defines the
dependency relationship as the leaf being the child of the Pathlist --
keeping consistency is important.



524        3. Pick the outgoing path "Pi" from the list of resolved paths in
525           the pathlist. The method by which the outgoing path is picked is
526           beyond the scope of this document (e.g. flow-preserving hash
527           exploiting entropy within the MPLS stack and IP header). Let the
528           "path-index" of the outgoing path "Pi" be "j".

[minor] s/Pick the outgoing path "Pi"/Pick an outgoing path "Pi"


[] "Let the "path-index" of the outgoing path "Pi" be "j"."

I'm not sure if you mean that "j is the path-index of Pi" or "assign a
path-index of j".  If the former, how is the path-index assigned?



530        4. If the prefix is labeled, use the "path-index" "j" to retrieve
531           the jth label "Lj" stored the jth entry in the OutLabel-List and
532           apply the label action of the label on the packet (e.g. for VPN
533           label on the ingress PE, the label action is "push"). As
534           mentioned in Section 1.1 the value of the "path-index" stored
535           in path may not necessarily be the same value of the location of
536           the path in the pathlist.

[minor] "retrieve the jth label "Lj" stored the jth entry in the OutLabel-List"

This sounds as if there were (at least) j labels in each of (at least)
j entries in the OutLabel-List.   But I assume that there is really
only one label (or label stack) in that entry.   ??



538        5. Move to the parent of the chosen path "Pi".

540        6. If the chosen path "Pi" is recursive, move to its parent prefix
541           and go to step 2.

543        7. If the chosen path is non-recursive move to its parent adjacency.
544           Otherwise go to the next step.

[minor] These 2 steps say the same thing: "move to the parent
of..."Pi"", "path "Pi"...move to its parent", and "move to its
parent".  I think you need to better differentiate (?) the options.



...
549        Let's apply the above forwarding steps to the forwarding chain
550        depicted in Figure 2 in Section 2. Suppose a packet arrives at
551        ingress PE iPE from an external neighbor. Assume the packet matches
552        the VPN prefix VPN-IP1. While walking the forwarding chain, the
553        forwarding engine applies a hashing algorithm to choose the path and
554        the hashing at the BGP level yields path 0 while the hashing at the
555        IGP level yields path 1. In that case, the packet will be sent out
556        of interface I2 with the label stack "IGP-L12,VPN-L11".

[minor] s/path 0...path 1/path-index 0...path-index 1


[minor] The example would be better understood if it was presented as
steps matching the process described above (vs a narrative
explanation).  Also, be specific in which path is the one with each
path-index...


[major] Figure 2 doesn't contain any indication of a path-index.

558     5. Handling Platforms with Limited Levels of Hierarchy


[] This whole section, which is longer than the text describing the
"normal" mechanism (!), seems to be more appropriate for an Appendix
than the main body of the document.  Please move it.



...
564        o  Being able to reduce the number of hierarchical levels from any
565           arbitrary value to a smaller arbitrary value that can be
566           supported by the forwarding engine.

[?] What is the worst case in terms of number of levels?  3?  Just
curious. I'm asking simply because an "arbitrary value" sounds like it
could be 10s o even 1000s, when in reality it is not more than a
handful.  Are there many platforms out there that wouldn't support 2-3
levels (for the number of routes/paths they normally would carry)?


[minor] "smaller arbitrary value"

This smaller value is not really arbitrary (as in random), it is
determined by the platform.  Maybe something along the lines of a
"value supported by the platform" would be better.



...
571     5.1. Flattening the Forwarding Chain
...
584        1. The forwarding engine is now at leaf "R1".
...
605        9. Now the forwarding engine continues the walk to the parent of
606           "Qj".

[minor] This list of steps is just a restatement of the process in the
last section.  IMO there's no need to repeat the steps here.  Point at
the 2-level example, and jump directly to the next paragraph.

To reference from the explanation below, "a picture is worth a
thousand words". ;-)



...
614        1. FIB manager wants to reduce the number of levels used by "Pi" by
615           1.

[nit] This is not really a step...



...
622        4. FIB manager also extracts the OutLabel-list(R2) associated with
623           the leaf "R2". Remember that OutLabel-list(R2) = <L1, L2,...,
624           Lm>.

[] New syntax: "OutLabel-list(R2)".



...
643        It is noteworthy to mention that the labels in the OutLabel-list
644        associated with the "flattened" pathlist may be stored in the same
645        memory location as the path itself to avoid additional memory
646        access. But that is an implementation detail that is beyond the
647        scope of this document.

[] Why even mention it then?



649        The same steps can be applied to all paths in the pathlist <P1,
650        P2,..., Pn> so that all paths are "flattened" thereby reducing the
651        number of hierarchical levels by one. Note that that "flattening" a
652        pathlist pulls in all paths of the parent paths, a desired feature
653        to utilize all ECMP/UCMP paths at all levels. A platform that has a
654        limit on the number of paths in a pathlist for any given leaf may
655        choose to reduce the number paths using methods that are beyond the
656        scope of this document.

[minor] Expand UCMP.



...
663        Because a flattened pathlist may have an associated OutLabel-list
664        the forwarding behavior has to be slightly modified. The
665        modification is done by adding the following step right after step 4
666        in Section 4.

668        5. If there is an OutLabel-list associated with the pathlist, then
669           if the path "Pi" is chosen by the hashing algorithm, retrieve the
670           label at location "i" in that OutLabel-list and apply the label
671           action of that label on the packet.

[major] Step 4 says this:

   4. If the prefix is labeled, use the "path-index" "j" to retrieve
      the jth label "Lj" stored the jth entry in the OutLabel-List and
      apply the label action of the label on the packet...

Besides using "j" instead of "i" as the path-index, both seem to say
the same thing.  Am I missing something?



...
676     5.2. Example: Flattening a forwarding chain.

678        This example uses a case of inter-AS option C [9] where there are 3
679        levels of hierarchy. Figure 4 illustrates the sample topology. To
680        force 3 levels of hierarchy, the ASBRs[9] on the ingress domain
681        (domain 1) advertise the core routers of the egress domain (domain
682        2) to the ingress PE (iPE) via BGP-LU [3] instead of redistributing
683        them into the IGP of domain 1. The end result is that the ingress PE
684        (iPE) has 2 levels of recursion for the VPN prefixes VPN-IP1 and
685        VPN-IP2.

[] Suggestion:

OLD>
   To force 3 levels of hierarchy, the ASBRs[9] on the ingress
   domain (domain 1) advertise the core routers of the egress
   domain (domain 2) to the ingress PE (iPE) via BGP-LU [3]
   instead of redistributing them into the IGP of domain 1.

NEW>
   The ASBRs on the ingress domain (Domain 1) use BGP to advertise
   the core routers (ASBRs and ePEs) of the egress domain (Domain
   2) to the iPE.


[minor] Expand ASBR on first use.  No need to add a reference.



687            Domain 1                 Domain 2
688        +-------------+          +-------------+
689        |             |          |             |
690        | LDP/SR Core |          | LDP/SR core |
691        |             |          |             |
692        |     (192.0.2.4)        |             |
693        |         ASBR11-------ASBR21........ePE1(192.0.2.1)
694        |             | \      / |   .      .  |\
695        |             |  \    /  |    .    .   | \
696        |             |   \  /   |     .  .    |  \
697        |             |    \/    |      ..     |   \VPN-IP1(198.51.100.0/24)
698        |             |    /\    |      . .    |   /VRF "Blue" ASn: 65000
699        |             |   /  \   |     .   .   |  /
700        |             |  /    \  |    .     .  | /
701        |             | /      \ |   .       . |/
702        iPE        ASBR12-------ASBR22........ePE2 (192.0.2.2)
703        |     (192.0.2.5)        |             |\
704        |             |          |             | \
705        |             |          |             |  \
706        |             |          |             |   \VRF "Blue" ASn: 65000
707        |             |          |             |   /VPN-IP2(203.0.113.0/24)
708        |             |          |             |  /
709        |             |          |             | /
710        |             |          |             |/
711        |         ASBR13-------ASBR23........ePE3(192.0.2.3)
712        |     (192.0.2.6)        |             |
713        |             |          |             |
714        |             |          |             |
715        +-------------+          +-------------+
716         <===========  <=========  <============
717        Advertise ePEx  Advertise   Redistribute
718        Using iBGP-LU   ePEx Using    IGP into
719                         eBGP-LU        BGP

721                    Figure 4 : Sample 3-level hierarchy topology

[minor] s/ASn/ASN/g


[major] "Redistribute IGP into BGP"  ??

I'm sure you don't mean that, but probably more something like
"redistribute the ePEx routes only"...



723        We will make the following assumptions about connectivity

[nit] s/connectivity/connectivity:



725        o  In "domain 2", both ASBR21 and ASBR22 can reach both ePE1 and
726           ePE2 using the same distance.

[nit] The figure capitalizes "Domain", but the text doesn't.  Please
be consistent!


[minor] By "distance" I assume you mean "IGP metric".  Please use that
term instead.   Apply to other occurences.



...
733        We will make the following assumptions about the labels

[nit] s/labels/labels:



...
740        o  The labels advertised by ASBR11 to iPE using BGP-LU [3] for the
741           egress PEs ePE1 and ePE2 are LASBR111(ePE1) and LASBR112(ePE2),
742           respectively.

[] Why change the notation of the labels?  This new notation
("LASBR111(ePE1)") just makes the explanation more complex. :-(



...
764              65000: 198.51.100.0/24
765                 via ePE1 (192.0.2.1), VPN Label: VPN-L11
766                 via ePE2 (192.0.2.2), VPN Label: VPN-L21
767              65000: 203.0.113.0/24
768                 via ePE2 (192.0.2.2), VPN Label: VPN-L22
769                 via ePE3 (192.0.2.3), VPN Label: VPN-L32

771              192.0.2.1/32 (ePE1)
772                 Via ASBR11, BGP-LU Label: LASBR111(ePE1)
773                 Via ASBR12, BGP-LU Label: LASBR121(ePE1)
774              192.0.2.2/32 (ePE2)
775                 Via ASBR11, BGP-LU Label: LASBR112(ePE2)
776                 Via ASBR12, BGP-LU Label: LASBR122(ePE2)
777              192.0.2.3/32 (ePE3)
778                 Via ASBR13, BGP-LU Label: LASBR13(ePE3)

780              192.0.2.4/32 (ASBR11)
781                 via Core, Label:  IGP-L11
782              192.0.2.5/32 (ASBR12)
783                 via Core, Label:  IGP-L12
784              192.0.2.6/32 (ASBR13)
785                 via Core, Label:  IGP-L13

[nit] s/Via/via/g


[minor] Only some of the addresses are annotated ("192.0.2.1/32
(ePE1)").  This is helpful, but inconsistent -- both within this
example and with previous ones.



...
797        o  The index inside the pathlist entry indicates the label that will
798           be picked from the OutLabel-List associated with the child leaf
799           if that path is chosen by the forwarding engine hashing function.

[] s/index/path-index



801        OutLabel-List                                      OutLabel-List
802          For VPN-IP1                                         For VPN-IP2
803        +------------+    +--------+           +-------+   +------------+
804        |  VPN-L11   |<---| VPN-IP1|           |VPN-IP2|-->|  VPN-L22   |
805        +------------+    +---+----+           +---+---+   +------------+
806        |  VPN-L21   |        |                    |       |  VPN-L32   |
807        +------------+        |                    |       +------------+
808                              |                    |
809                              V                    V
810                         +---+---+            +---+---+
811                         | 0 | 1 |            | 0 | 1 |
812                         +-|-+-\-+            +-/-+-\-+
813                           |    \              /     \
814                           |     \            /       \
815                           |      \          /         \
816                           |       \        /           \
817                           v        \      /             \
818                      +-----+       +-----+             +-----+
819                 +----+ ePE1|       |ePE2 +-----+       | ePE3+-----+
820                 |    +--+--+       +-----+     |       +--+--+     |
821                 v       |            /         v          |        v
822        +--------------+ |           /   +--------------+  | +-------------+
823        |LASBR111(ePE1)| |          /    |LASBR112(ePE2)|  | |LASBR13(ePE3)|
824        +--------------+ |         /     +--------------+  | +-------------+
825        |LASBR121(ePE1)| |        /      |LASBR122(ePE2)|  | OutLabel-List
826        +--------------+ |       /       +--------------+  |    For ePE3
827        OutLabel-List    |      /        OutLabel-List     |
828            For ePE1     |     /           For ePE2        |
829                         |    /                            |
830                         |   /                             |
831                         |  /                              |
832                         v /                               v
833                     +---+---+  Shared Pathlist          +---+  Pathlist
834                     | 0 | 1 | For ePE1 and ePE2         | 0 |  For ePE3
835                     +-|-+-\-+                           +-|-+
836                       |    \                              |
837                       |     \                             |
838                       |      \                            |
839                       |       \                           |
840                       v        \                          v
841                    +------+    +------+               +------+
842                +---+ASBR11|    |ASBR12+--+            |ASBR13+---+
843                |   +------+    +------+  |            +------+   |
844                v                         v                       v
845           +-------+                  +-------+              +-------+
846           |IGP-L11|                  |IGP-L12|              |IGP-L13|
847           +-------+                  +-------+              +-------+

849            Figure 5 : Forwarding Chain for hardware supporting 3 Levels

[] The OutLabel-Lists at the bottom are not tagged as such.


[] Some down arrows ("v") are missing -- on the non-vertical lines.



851        Now suppose the hardware on iPE (the ingress PE) supports 2 levels
852        of hierarchy only. In that case, the 3-levels forwarding chain in
853        Figure 5 needs to be "flattened" into 2 levels only.

855        OutLabel-List                                  OutLabel-List
856          For VPN-IP1                                    For VPN-IP2
857        +------------+    +-------+      +-------+     +------------+
858        |  VPN-L11   |<---|VPN-IP1|      | VPN-IP2|--->|  VPN-L22   |
859        +------------+    +---+---+      +---+---+     +------------+
860        |  VPN-L21   |        |              |         |  VPN-L32   |
861        +------------+        |              |         +------------+
862                              |              |
863                              |              |
864                              |              |
865               Flattened      |              |  Flattened
866               pathlist       V              V   pathlist
867                         +===+===+        +===+===+===+     +==============+
868                +--------+ 0 | 1 |        | 0 | 0 | 1 +---->|LASBR112(ePE2)|
869                |        +=|=+=\=+        +=/=+=/=+=\=+     +==============+
870                v          |    \          /   /     \      |LASBR122(ePE2)|
871         +==============+  |     \  +-----+   /       \     +==============+
872         |LASBR111(ePE1)|  |      \/         /         \    |LASBR13(ePE3) |
873         +==============+  |      /\        /           \   +==============+
874         |LASBR121(ePE1)|  |     /  \      /             \
875         +==============+  |    /    \    /               \
876                           |   /      \  /                 \
877                           |  /       +  +                  \
878                           |  +       |  |                   \
879                           |  |       |  |                    \
880                           v  v       v  v                     \
881                         +------+    +------+              +------+
882                    +----|ASBR11|    |ASBR12+---+          |ASBR13+---+
883                    |    +------+    +------+   |          +------+   |
884                    v                           v                     v
885                +-------+                  +-------+              +-------+
886                |IGP-L11|                  |IGP-L12|              |IGP-L13|
887                +-------+                  +-------+              +-------+

[] Same comments as above.



889          Figure 6 : Flattening 3 levels to 2 levels of Hierarchy on iPE

891        Figure 6 represents one way to "flatten" a 3 levels hierarchy into
892        two levels. There are few important points:

[nit] s/are few/are a few



...
908        o  Let's take a look at the flattened pathlist used by the prefix
909           "VPN-IP2", The pathlist associated with the prefix "VPN-IP2" has
910           three entries.

[nit] s/"VPN-IP2", The/"VPN-IP2". The



912            o The first and second entry have index "0". This is because
913              both entries correspond to ePE2. Thus when hashing performed
914              by the forwarding engine results in using first or the second
915              entry in the pathlist, the forwarding engine will pick the
916              correct VPN label "VPN-L22", which is the label advertised by
917              ePE2 for the prefix "VPN-IP2".

[nit] s/using first/using the first



...
952        o  So the packet is forwarded towards the ASBR "ASBR12" and the IGP
953           label at the top will be "L12".

[minor] The figure calls the label "IGP-L12", not just "L12".  There
are other instances of this below.



...
982     6. Forwarding Chain Adjustment at a Failure

984        The hierarchical and shared structure of the forwarding chain
985        explained in the previous section allows modifying a small number of
986        forwarding chain objects to re-route traffic to a pre-calculated
987        equal-cost or backup path without the need to modify the possibly
988        very large number of BGP prefixes. In this section, we go over
989        various core and edge failure scenarios to illustrate how FIB
990        manager can utilize the forwarding chain structure to achieve BGP
991        prefix independent convergence.

[nit] s/how FIB manager/how FIB the manager



993     6.1. BGP-PIC core

[minor] The meaning of "core" and "edge" should be explained somwehre.



...
1001       When a remote link or node fails, IGP on the ingress PE receives
1002       advertisement indicating a topology change so IGP re-converges to
1003       either find a new next-hop and/or outgoing interface or remove the
1004       path completely from the IGP prefix used to resolve BGP next-hops.
1005       IGP and/or LDP download the modified IGP leaves with modified
1006       outgoing labels for labeled core.

[nit] s/IGP/the IGP/g


[nit] s/receives advertisement/receives an advertisement


[nit] s/for labeled core/for the labeled core


[] "download"

This is the second time that the "download" concept is used without
explanation.  See my comment in §3.1.



1008       When a local link fails, FIB manager detects the failure almost
1009       immediately. The FIB manager marks the impacted path(s) as unusable
1010       so that only useable paths are used to forward packets. Hence only
1011       IGP pathlists with paths using the failed local link need to be
1012       modified. All other pathlists are not impacted. Note that in this
1013       particular case there is actually no need even to backwalk to IGP
1014       leaves to adjust the OutLabel-Lists because FIB can rely on the
1015       path-index stored in the useable paths in the pathlist to pick the
1016       right label.

[nit] s/is actually no need even to/is no need to


[nit] s/to IGP leaves/to the IGP leaves


[major] The term "backwalk" is introduced here, but not explained.



1018       It is noteworthy to mention that because FIB manager modifies the
1019       forwarding chain starting from the IGP leaves only. BGP pathlists
1020       and leaves are not modified. Hence traffic restoration occurs within
1021       the time frame of IGP convergence, and, for local link failure,
1022       assuming a backup path has been precomputed, within the timeframe of
1023       local detection (e.g. 50ms). Examples of solutions that pre-
1024       computing backup paths are IP FRR [16] remote LFA [17], Ti-LFA [15]
1025       and MRT [18] or eBGP path having a backup path [11].

[major] The case on the previous paragraph deals with a local link
failure that results in the routes being "unusable", but this
paragraph mentions restoration "assuming a backup path has been
precomputed" -- the restoration timeframe also applies to the case
where there is no backup path (as explained above).

Note that the next paragraph also talks about a backup route and seems
to ignore the "unusable" case above.


[nit] s/solutions that pre- computing/solutions that can pre-compute


[nit] s/Ti-LFA/TI-LFA


[minor] Expand FRR, LFA, TI-LFA, and MRT on first use.


[nit] s/IP FRR [16] remote LFA...[18] or eBGP/IP FRR [16], remote
LFA...[18], or eBGP


[minor] "eBGP path having a backup path [11]"

I don't think that's the right reference.



1027       Let's apply the procedure mentioned in this subsection to the
1028       forwarding chain depicted in Figure 2. Suppose a remote link failure
1029       occurs and impacts the first ECMP IGP path to the remote BGP next-
1030       hop. Upon IGP convergence, the IGP pathlist used by the BGP next-hop
1031       is updated to reflect the new topology (one path instead of two). As
1032       soon as the IGP convergence is effective for the BGP next-hop entry,
1033       the new forwarding state is immediately available to all dependent
1034       BGP prefixes. The same behavior would occur if the failure was local
1035       such as an interface going down. As soon as the IGP convergence is
1036       complete for the BGP next-hop IGP route, all its BGP depending
1037       routes benefit from the new path. In fact, upon local failure, if
1038       LFA protection is enabled for the IGP route to the BGP next-hop and
1039       a backup path was pre-computed and installed in the pathlist, upon
1040       the local interface failure, the LFA backup path is immediately
1041       activated (e.g. sub-50msec) and thus protection benefits all the
1042       depending BGP traffic through the hierarchical forwarding dependency
1043       between the routes.

[minor] "Upon IGP convergence... As soon as the IGP convergence is
effective for the BGP next-hop entry,"

There's some redundancy in the text.  Also, by using different text,
you're implying that there is a difference between "IGP convergence"
and "IGP convergence...effective for the BGP next-hop entry".


[major] "The same behavior would occur if the failure was local such
as an interface going down. As soon as the IGP convergence..."

According to the text above (2 or 3 paragraphs before), IGP
convergence is not necessary in the local failure case.



...
1050    6.2.1. Adjusting forwarding Chain in egress node failure

[minor] Most of the text talks about an "egress node".  I assume that
is the same as an "edge node" -- is it?  If so, please be consistent
and/or explain somewhere that you're referring to the same thing.


1052       When an edge node fails, IGP on neighboring core nodes send route
1053       updates indicating that the edge node is no longer reachable. IGP
1054       running on the iPE instructs FIB to remove the IP and label leaves
1055       corresponding to the failed edge node from FIB. So FIB manager
1056       performs the following steps:

[major] "IGP on neighboring core nodes send route updates indicating
that the edge node is no longer reachable"

§6.1 says that "Node failures are treated as link failures."   Is that
the same in this section?

The neighbor nodes can't detect that the node is down -- they can only
detect the link being down.  The rest of the explanation assumes that
there was a single link connecting the egress node to the core, or
that all the links were reported as down.  Neither case is explained.

If there is more than one link there shouldn't be a failure in
reaching the next-hop -- except for the change in the local router(s),
similar to a core failure.



...
1073    6.2.2. Adjusting Forwarding Chain on PE-CE link Failure
...
1082       In the first case, the rest of iBGP peers will remain unaware of the
1083       link failure and will continue to forward traffic to the edge node
1084       until the edge node attached to the failed link withdraws the BGP
1085       prefixes. If the destination prefixes are multi-homed to another
1086       iBGP peer, say ePE2, then FIB manager on the edge router detecting
1087       the link failure applies the following steps (see Figure 3):

[minor] Figure 3 is an example -- it does show a sample chain.  Figure
4 may be used as a reference for the type of backup described here.



...
1100           o The label entry in OutLabel-List corresponding to the
1101              internal path to backup egress PE has swap action to the
1102              label advertised by backup egress PE.

[nit] s/has swap action/has a swap action


[nit] s/by backup/by the backup



1104           o For an arriving label packet (e.g. VPN), the top label is
1105              swapped with the label advertised by backup egress PE and the
1106              packet is sent towards that backup egress PE.

[nit] s/label packet/labeled packet


[major] Even though you're expanding on an example, it is important
for the text to be precise.

- The label mentioned is "e.g. VPN", but you're been using more
specific labels -- from figure 3, "VPN-L21"...

- The OutLabel-Lists in Figure 3 show both "VPN-L21 (swap)" and
"VPN-L21 (push)".  The same label is used, only the swap action is
mentioned here.  What am I missing?

- As I mentioned before, the diagram in Figure 3 is not clear.



1108       o  For unlabeled traffic, packets are simply redirected towards
1109          backup egress PE.

[nit] s/backup/the backups


[major] "ackets are simply redirected towards backup egress PE"

How?  The backup ePE is presumably not connected (which is why a label
was needed)...??



1111       In the second case where the edge router uses the IP address of the
1112       failed link as the BGP next-hop, the edge router will still perform
1113       the previous steps. But, unlike the case of next-hop self, IGP on
1114       failed edge node informs the rest of the iBGP peers that IP address
1115       of the failed link is no longer reachable. Hence the FIB manager on
1116       iBGP peers will delete the IGP leaf corresponding to the IP prefix
1117       of the failed link. The behavior of the iBGP peers will be identical
1118       to the case of edge node failure outlined in Section 6.2.1.

[nit] s/IGP on failed edge node/the IGP on the failed edge node


[nit] s/that IP address/that the IP address



1120       It is noteworthy to mention that because the edge link failure is
1121       local to the edge router, sub-50 msec convergence can be achieved as
1122       described in [11].

[] But only at the edge router -- not in other places of the network
where the IGP has to converge.

Please keep the numbers in the Appendix.  IOW, there's no need to
claim this performance here.



1124       Let's try to apply the case of next-hop self to the forwarding chain
1125       depicted in Figure 3. After failure of the link between ePE1 and CE,
1126       the forwarding engine will route traffic arriving from the core
1127       towards VPN-NH2 with path-index=1. A packet arriving from the core
1128       will contain the label VPN-L11 at top. The label VPN-L11 is swapped
1129       with the label VPN-L21 and the packet is forwarded towards ePE2.

[minor] This paragraph is out of place: you already tried to explain
this case a few paragraphs above.  Please move the text and
deduplicate.



1131    6.3. Handling Failures for Flattened Forwarding Chains

1133       As explained in the in Section 5 if the number of hierarchy levels
1134       of a platform cannot support the native number of hierarchy levels
1135       of a recursive forwarding chain, the instantiated forwarding chain
1136       is constructed by flattening two or more levels. Hence a 3 levels
1137       chain in Figure 5 is flattened into the 2 levels chain in Figure 6.

[nit] s/a 3 levels/the 3-level


[nit] s/2 levels/2-level



1139       While reducing the benefits of BGP-PIC, flattening one hierarchy
1140       into a shallower hierarchy does not always result in a complete loss
1141       of the benefits of the BGP-PIC. To illustrate this fact suppose
1142       ASBR12 is no longer reachable in domain 1. If the platform supports
1143       the full hierarchy depth, the forwarding chain is the one depicted
1144       in Figure 5 and hence the FIB manager needs to backwalk one level to
1145       the pathlist shared by "ePE1" and "ePE2" and adjust it. If the
1146       platform supports 2 levels of hierarchy, then a useable forwarding
1147       chain is the one depicted in Figure 6. In that case, if ASBR12 is no
1148       longer reachable, the FIB manager has to backwalk to the two
1149       flattened pathlists and updates both of them.

1151       The main observation is that the loss of convergence speed due to
1152       the loss of hierarchy depth depends on the structure of the
1153       forwarding chain itself. To illustrate this fact, let's take two
1154       extremes. Suppose the forwarding objects in level i+1 depend on the
1155       forwarding objects in level i. If every object on level i+1 depends
1156       on a separate object in level i, then flattening level i into level
1157       i+1 will not result in loss of convergence speed. Now let's take the
1158       other extreme. Suppose "n" objects in level i+1 depend on 1 object
1159       in level i. Now suppose FIB flattens level i into level i+1. If a
1160       topology change results in modifying the single object in level i,
1161       then FIB has to backwalk and modify "n" objects in the flattened
1162       level, thereby losing all the benefit of BGP-PIC. Experience shows
1163       that flattening forwarding chains usually results in moderate loss
1164       of BGP-PIC benefits. Further analysis is needed to corroborate and
1165       quantify this statement.

[major] "The main observation is that the loss of convergence speed
due to the loss of hierarchy depth... Further analysis is needed to
corroborate and quantify this statement."

The explanation above sounds plausible.  However, if you really don't
know, and there is a dependency on the structure, and it depends on
the failure, and there's no alternative (if the platform supports less
levels)...  Why even include this information?  It feels like the
equivalent of "hand waving". :-(



1167    7. Properties

[minor] This section is really a summary.  I wonder if it would serve
a better purpose at the start of the document as a "road map".



1169    7.1. Coverage

1171       All the possible failures, except CE node failure, are covered,
1172       whether they impact a local or remote IGP path or a local or remote
1173       BGP next-hop as described in Section 6. This section provides
1174       details for each failure and how the hierarchical and shared FIB
1175       structure proposed in this document allows recovery that does not
1176       depend on number of BGP prefixes.

[major] "except CE node failure"

Isn't that covered in §6.2.2?



1178    7.1.1. A remote failure on the path to a BGP next-hop

1180       Upon IGP convergence, the IGP leaf for the BGP next-hop is updated
1181       upon IGP convergence and all the BGP depending routes leverage the
1182       new IGP forwarding state immediately. Details of this behavior can
1183       be found in Section 6.1.

[minor] Redundant text: "Upon IGP convergence, the IGP leaf for the
BGP next-hop is updated upon IGP convergence..."



1185       This BGP resiliency property only depends on IGP convergence and is
1186       independent of the number of BGP prefixes impacted.

[] "BGP resiliency property"

The rest of the text (and the name of the document!) refers to
convergence.  It isn't until this section that you change the
terminology.  Please be consistent.



1188    7.1.2. A local failure on the path to a BGP next-hop

1190       Upon LFA protection, the IGP leaf for the BGP next-hop is updated to
1191       use the precomputed LFA backup path and all the BGP depending routes
1192       leverage this LFA protection. Details of this behavior can be found
1193       in Section 6.1.

[minor] Redundant text: LFA is mentioned 3 times in the first sentence.


[major] What about the case where an LFA doesn't exist?



...
1238    7.3. Automated

1240       The BGP-PIC solution does not require any operator involvement. The
1241       process is entirely automated as part of the FIB implementation.

[] Ok -- it is not as much an automation as it is an implementation choice.



1243       The salient points enabling this automation are:

[minor] I"m not sure if you're saying that the list below is what
enables the "automation", or if the "automation" results in that.



1245       o  Extension of the BGP Best Path to compute more than one primary
1246          ([12]and [13]) or backup BGP next-hop ([7] and [14]).

[major] This functionality is available regardless of the structure of the FIB.



1248       o  Sharing of BGP Path-list across BGP destinations with same
1249          primary and backup BGP next-hop.

[nit] s/same/the same



1251       o  Hierarchical indirection and dependency between BGP pathlist and
1252          IGP pathlist.

[] The last two bullets are a characteristic of the implementation.



1254    7.4. Incremental Deployment

1256       As soon as one router supports BGP-PIC solution, it benefits from
1257       all its benefits (most notably convergence that does not depend in
1258       the number of prefixes) without any requirement for other routers to
1259       support BGP-PIC.

[nit] Reduncant text: "it benefits from all its benefits"


[major] The assertion above is true.  However, (in the extreme) a
single router supporting BGP-PIC doesn't necessarily translate in to
better overall convergence in the network (it depends on where the
router is, the failure, etc.).  Also, having a mix of routers could
result in transient micro loops as the speed of convergence is not the
same.



1261    8. Security Considerations

1263       The behavior described in this document is internal functionality
1264       to a router that result in significant improvement to convergence
1265       time as well as reduction in CPU and memory used by FIB while not
1266       showing change in basic routing and forwarding functionality. As
1267       such no additional security risk is introduced by using the
1268       mechanisms proposed in this document.

[] I can't think of anything else to say here, but you did not show
(nor do I need you to) "significant...reduction in CPU and memory
used".  It sounds more like a marketing statement. :-(



1270    9. IANA Considerations

1272       No requirements for IANA

[major] s/.../This document has no IANA actions.



1274    10. Conclusions

[] This section is not needed, it contains redundant information.
Please remove it.



...
1283    11. References

[major] The RFC Editor requires that the OID be listed for all
references [1].  Please update the references -- here's an example of
what they should look like [2]:

   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border
              Gateway Protocol 4 (BGP-4)", RFC 4271, DOI 10.17487/RFC4271,
              January 2006, <https://www.rfc-editor.org/info/rfc4271>.

[1] https://www.rfc-editor.org/styleguide/part2/
[2] https://www.rfc-editor.org/refs/ref4271.txt



1285    11.1. Normative References
...
1293       [3]   E. Rosen, " Carrying Label Information in BGP-4", RFC 8277,
1294             October 2017

[minor] This reference can be Informative.



1296       [4]   Andersson, L., Minei, I., and B. Thomas, "LDP Specification",
1297             RFC 5036, October 2007

[minor] This reference can be Informative.



1299       [5]   A. Bashandy, C. Filsfils, S. Previdi, B. Decraene, S.
1300             Litkowski, M. Horneffer, R. Shakir, "Segment Routing with MPLS
1301             data plane", RFC 8660, December 2019

[minor] This reference can be Informative.



...
1377    Appendix A.                 Perspective
...
1392       o  BGP Convergence per BGP destination ~ 200usec conservative,

1394                                              ~ 100usec optimistic

[minor] "BGP Convergence"

The explanation below talks about recalculating the best route --
which is different than convergence (which includes sending BGP
Updates, etc.).

[EoR -18]

_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

Reply via email to