RE: Comments on draft-kurapati-dynamicrp-bgpmvpn-00.txt

Kurt Windisch Thu, 25 Jul 2013 18:06:46 -0700

Eric and Stig, 

Sorry for the delay in responding to your excellent feedback. All of 
your comments are very appreciated and I will try to address them as 
best I can below. There are very good points here that can be 
addressed in a new revision of the draft.

Eric wrote:

> We are wondering why the draft proposes to use a new SAFI, rather than
  reusing the MCAST-VPN SAFI. (…)

<Kurt>The reason for choosing to use a new SAFI is for support of the 
BSR routes to be negotiated by SAFI. If we were to find a way to leverage 
end-of-rib signaling in routine sets of NLRI updates, a separate SAFI 
would have other benefits (but see the end-of-rib comments below).

Alternately, it may be possible to simply rely on implementations 
to properly discard NLRIs of unknown types within the MCAST-VPN SAFI.

> The draft seems to allow an AFI of "IPv6" to be used together with an IPv4
  BSR address, or vice versa. (…)

<Kurt>This is not the intent. The only intended use of "mixed" AFIs would 
be in the "Originating PE's IP Address", which is the PE's address 
in the service provider core and may be a different AFI than the AFI 
of the customer network for which this SAFI is advertising BSR. 
For consistency, we should adopt the convention of RFC 6515 here 
as the issues are the same. We should also explicitly require that 
the AFI of other IP addresses in the NLRIs match the AFI of the NLRI.
Current language describing the possible values of "len of BSR 
address" and other similar passages must be tightened accordingly.

> For RP addresses and Group addresses the draft proposes to use the
  "encoded" formats from the PIM spec.  These formats contain an octet that
  identifies the address family.  There should be a requirement that the
  address family as encoded in the "encoded format" be the same as the
  address family identified in the BGP Update's AFI, and that the lengths be
  appropriate for that address family.

<Kurt>Agreed.

> In the BGP Update, parsing would be simpler if the Length field that
  precedes an "encoded group format" field or an "encoded unicast address
  field" contains the length of that field, not the length of the address
  prefix that appears within the encoded format.

<Kurt>Agreed.

> The mention of the "VRF Route Import Extended Community" in section 4.1
  should say "VRF Route Import Extended Community" or "VRF Route Import IPv6
  Address Specific Extended Community", to cover the case of a SP with an
  IPv6 infrastructure.  (It also needs to be made clear that this applies to
  the NLRI of sections 4.2 and 4.3 as well.)

<Kurt>Agreed.

> What action is to be taken if a BGP Update with an MCAST-VPN-BSR NLRI is
  received, but there is no BSR-BGP Path attribute?

<Kurt>Since all BGP-BSR NLRI route types must carry the BGP-BSR Path 
attribute per the current text, I think the proper response to a
route update lacking this attribute is to declare it malformed and 
require that the NLRI be discarded without processing of its contents.

> It's hard to interpret phrases like "the group count for this NLRI is not
  set".  How does one send this attribute without "setting" all its fields?
  Does "not set" just mean "set to zero", or does it mean only that certain
  fields are irrelevant to the processing of certain NLRIs.

<Kurt>While it's true that certain fields are not relevant to all NLRIs 
(see the next comment), this language can be clarified to be "set to zero."
The next comment does a nice job of defining these behaviors.

> The draft could use a little table to show which fields affect the
  processing which received NLRIs: 
  (…)
  Where a particular NLRI/field combination is "No", perhaps what the draft
  should say is that the field MUST be ignored when processing that type of
  NLRI.  That would allow the ignored field to carry any value, without risk
  of any interoperability problems.  If one only says "SHOULD be ignored",
  there may be interoperability problems.

<Kurt>This is correct and the draft should adopt these excellent suggestions.

> Regarding Fragmentation Tags
  There don't seem to be any clear instructions as to when the fragmentation
  tag field of the BSR-BGP attribute of a given NLRI actually needs to be
  changed. (…) why can't fragmentation  be entirely a local matter 
  (i.e., not communicated across the net)?

<Kurt>This is a good observation and we agree that the BSR Fragment Tag is 
not needed in BGP. It should be removed and the table from the previous 
comment can also be updated accordingly.

> Constructing BSMs from the Counts
  Suppose an ingress PE receives a BSM with 15 RP mappings for a given
  group.  Then it receives another BSM with 15 RP mappings for that group,
  10 of which are the same, and 5 of which are different.

  It seems that if the egress PE receives "withdraw, update, withdraw,
  update, withdraw, update, withdraw, update, withdraw, update", it could
  generate five BSMs.  Is our understanding correct, or are we missing
  something?

<Kurt>The egress PE may receive BGP NLRI updates as described above, such that
the counts contained in the NRLIs do not give the ability to determine
complete reception of the full set of updates that came from the BSM on the
ingress PE. 

The worst-case would be that each withdraw and new update arrives in a 
separate BGP UPDATE message, which depending on the implementation may 
trigger BSR to attempt to generate multiple BSMs to CEs. 

This is a worst-case scenario. The NRLI design attempts to account for 
the full set of updates using the various counters so that it can know
when it can send a complete BSM. But it must be acknowledged that the 
update stream that you give is one in which this protocol cannot operate
as well as native BSR.

However, it's likely that the ingress PE would batch the individual NLRI 
updates into a much fewer number of BGP UPDATE messages. 
The draft could make recommendations for the implementation to process all 
BGP NLRI updates that it has received before generating the new BSM to 
minimize the behavior above. 

One more point is that even if the NRLI updates arrive at the egress in 
separate BGP UPDATE messages at different times, triggering separate 
attempts to send BSMs, the BSR spec requires a minimum time between BSMs 
being sent (RFC 5059 specifies BS_min_interval with default of 10sec). 
Thus, it's likely that the second BSM would contain the updates to all 
15 mappings and the worst-case you described is mitigated. The issue 
then is the added delay to convergence introduced by the BS_min_interval
so it is still not ideal in comparison to native BSR.

> End of RIB 
  Given that there is almost always a route reflector between the ingress
  and egress PEs, how is the "End of RIB" marker going to be helpful in
  deciding when to originate a BSM?

<Pavan+Kurt>Thanks for pointing this out, that is correct.  Also, given that 
End of RIB implementations may differ, we will alter the draft 
in our next version accordingly to remove it.

> BS_Timeout
  There seems to be a problem with the following procedure from section
  6.2.2 ("Missing BSM") (…)

<Kurt>First off, I think this draft needs to clarify how it integrates with
the BSR state machine in the same terms of RFC5059. The precise events and
actions on the PEs needs to be specified so that it's clear how BGP events
on PEs interact with the local RP-Set and PE-CE BSMs.

Regarding the "Missing BSM" behavior, we need specify the following,
which will revise the original text and procedures:
* On the ingress PE, a timer must be run at the interval of 
  BS_Period (+ some small value to allow for jitter) such that if a BSM 
  has not been received from the local BSR, then the BSR Parameters NLRI.
* On the egress PE, upon withdrawal of the BSR parameters NLRI, the PE 
  must set its BS_Timeout to be BS_Timeout - BS_Period and BSMs must not
  be sent while the BSR Parameters NLRI is not present.

Thus, if timers are in sync and allowed jitter is small, behavior should 
mirror what would happen in native BSR from the points of view of CEs
on the egress side.
However, since we cannot rely on timers being perfectly in sync, it is
still possible for the egress site to retain the RPs from the missing 
BSM for up to one BS_Period.

This problem needs more thought to see if we can close this gap 
in relation to the native BSR operations.

> With regard to the sentence "As soon as the Type-1 is withdrawn,
  BS_Timeout period has to be started at the egress and upon its expiry, all
  the Type-2 and Type-3 entries MUST be deleted", it doesn't seem right for
  an egress PE to remove BGP installed routes based upon the expiry of a
  local timer.

<Kurt>Yes, this should read something to the effect that when BS_Timeout
expires on the egress PE, the RPs/groups discovered from the elected BSR
should be removed from the local RP-Set. The BGP routes will be deleted
when the ingress PE withdraws them.

Thanks again for the valuable comments.
We hope to produce a revisied draft in the coming weeks.

Thanks again
--Kurt

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Eric 
Rosen
Sent: Monday, June 24, 2013 2:51 PM
To: Pavan Kurapati
Cc: [email protected]; [email protected]; [email protected]
Subject: Comments on draft-kurapati-dynamicrp-bgpmvpn-00.txt

Stig Venaas and I have discussed draft-kurapati-dynamicrp-bgpmvpn-00.text,
and together we have prepared the following set of questions and comments.

- We are wondering why the draft proposes to use a new SAFI, rather than
  reusing the MCAST-VPN SAFI.

  Using a new SAFI does provide a bit more freedom in designing the NLRI,
  but the draft sticks to the basic NLRI format of the MCAST-VPN SAFI
  anyway.  I don't think the new SAFI would be used on a BGP session unless
  the MCAST-VPN SAFI is also used on that session, so why not just use the
  MCAST-VPN SAFI for the new route types?

- The draft seems to allow an AFI of "IPv6" to be used together with an IPv4
  BSR address, or vice versa.

  When using the type 1-7 routes of the MCAST-VPN SAFI, the AFI designates
  the address family being used by the customer.  The address family being
  used by the service provider is inferred from the various length
  computations that are discussed in RFC 6515.  It seems best to stick with
  that same convention for the new BSR route types.  That would mean that
  the field "BSR Address" must be of the address family identified by the
  AFI.  The address length would then have to be appropriate for that
  address family, or the Update would be considered malformed.

  On the other hand, if it were to be decided to use a new SAFI, it might
  make more sense to dispense with the RFC 6515 hacks altogether and
  explicitly encode the address family of each address.

- For RP addresses and Group addresses the draft proposes to use the
  "encoded" formats from the PIM spec.  These formats contain an octet that
  identifies the address family.  There should be a requirement that the
  address family as encoded in the "encoded format" be the same as the
  address family identified in the BGP Update's AFI, and that the lengths be
  appropriate for that address family.

- In the BGP Update, parsing would be simpler if the Length field that
  precedes an "encoded group format" field or an "encoded unicast address
  field" contains the length of that field, not the length of the address
  prefix that appears within the encoded format.

- The mention of the "VRF Route Import Extended Community" in section 4.1
  should say "VRF Route Import Extended Community" or "VRF Route Import IPv6
  Address Specific Extended Community", to cover the case of a SP with an
  IPv6 infrastructure.  (It also needs to be made clear that this applies to
  the NLRI of sections 4.2 and 4.3 as well.)

- What action is to be taken if a BGP Update with an MCAST-VPN-BSR NLRI is
  received, but there is no BSR-BGP Path attribute?

- It's hard to interpret phrases like "the group count for this NLRI is not
  set".  How does one send this attribute without "setting" all its fields?
  Does "not set" just mean "set to zero", or does it mean only that certain
  fields are irrelevant to the processing of certain NLRIs.

- The draft could use a little table to show which fields affect the
  processing which received NLRIs:

          NLRI                          FragTag      RP Count    Group Count    

          BSR Parameters                Yes             No          Yes

          BSM Group Parameters          Yes             Yes         No

          BSM RP Parameters             Yes             No          No

  I think this table corresponds to your intentions.

  Where a particular NLRI/field combination is "No", perhaps what the draft
  should say is that the field MUST be ignored when processing that type of
  NLRI.  That would allow the ignored field to carry any value, without risk
  of any interoperability problems.  If one only says "SHOULD be ignored",
  there may be interoperability problems.

- Regarding Fragmentation Tags

  There don't seem to be any clear instructions as to when the fragmentation
  tag field of the BSR-BGP attribute of a given NLRI actually needs to be
  changed.  As a result, it's difficult to figure out its uses.  If some
  customer is sending fragmented BSMs every minute, one doesn't want to have
  BGP update all its RP mappings every minute.  So just when does the
  attribute value have to change?  Hopefully not too often, or there will be
  a lot of BGP thrashing.

  It's difficult to understand why a fragmentation tag field is needed in the
  BSR-BGP attribute at all.  The Group Count and RP Count fields are really
  what control when an egress PE can send a BSM.  If an ingress PE doesn't
  advertise changes to a groups RP mappings until it has all the mappings
  for that group (which I think is required in BSR), why can't fragmentation
  be entirely a local matter (i.e., not communicated across the net)?  What
  are we missing?

- Constructing BSMs from the Counts

  Suppose an ingress PE receives a BSM with 15 RP mappings for a given
  group.  Then it receives another BSM with 15 RP mappings for that group,
  10 of which are the same, and 5 of which are different.

  It seems that if the egress PE receives "withdraw, update, withdraw,
  update, withdraw, update, withdraw, update, withdraw, update", it could
  generate five BSMs.  Is our understanding correct, or are we missing
  something?

- End of RIB

  Given that there is almost always a route reflector between the ingress
  and egress PEs, how is the "End of RIB" marker going to be helpful in
  deciding when to originate a BSM?

- BS_Timeout

  There seems to be a problem with the following procedure from section
  6.2.2 ("Missing BSM"):

       "Egress PE receiving a withdrawn "BSR Parameters" route (Type-1)
       MUST still keep the corresponding Type-2 and Type-3 entries.
       However, it MUST NOT advertise the BSM to the CE without the
       Type-1 route present.  As soon as the Type-1 is withdrawn,
       BS_Timeout period has to be started at the egress and upon its
       expiry, all the Type-2 and Type-3 entries MUST be deleted.

       Say the egress has generated BSM at t=0.  At t=1 BS_Period
       expired at ingress PE and ingress PE did not get the periodic
       BSM.  So, it withdraws type-1 (BSR Parameters).  Egress PE has
       already generated BSM just before the type-1 withdrawal was
       received.  The egress PE skips the next periodic BSM towards the
       CE.  But CE is "off" by BS_Period interval by now.  Once the
       BS_Timeout expires, egress PE removes all the type-2 and type-3
       entries.  CEs connected to egress PE will remove the same, a
       whole BS_Period later.  Hence, to avoid this issue, once the
       BS_Timeout expires,an egress PE MUST generate a new BSM towards
       CE with RP hold time set to "0" for all the type-2 and type-3
       entries.  This will make the CEs in sinc with the the PEs.  After
       generating the BSM, PE removes all the Type-2 and Type-3 entries
       as stated above.

   The problem is the following.  The holding times of the individual RP
   mapping entries may be longer than the BS_Timeout.  Typically if
   BS_Timeout fires, the remaining holding time of an RP mapping entry will
   be the difference between (a) its holding time as reported in the last
   received BSM and (b) BS_Timeout.  The above seems to set the RP holding
   times to zero as soon as BS_Timeout expires.  The problem with this is
   that it may cause the RP mappings to timeout before a new BSR can be
   elected.

   Perhaps the withdrawal of a BSR parameters route should trigger the
   transmission of a new BSM that doesn't set the RP-mapping holding times
   to zero, but that just reduces each RP-mapping holding time by
   BS_Timeout.  Well, that would correct the RP-mapping holding times
   downstream of an egress PE, but it would also have the side effect of
   restarting the BS_Timeout at the routers downstream of the egress PE.  So
   that doesn't seem right either.

- With regard to the sentence "As soon as the Type-1 is withdrawn,
  BS_Timeout period has to be started at the egress and upon its expiry, all
  the Type-2 and Type-3 entries MUST be deleted", it doesn't seem right for
  an egress PE to remove BGP installed routes based upon the expiry of a
  local timer.

RE: Comments on draft-kurapati-dynamicrp-bgpmvpn-00.txt

Reply via email to