Eric and Stig,
Sorry for the delay in responding to your excellent feedback. All of
your comments are very appreciated and I will try to address them as
best I can below. There are very good points here that can be
addressed in a new revision of the draft.
Eric wrote:
> We are wondering why the draft proposes to use a new SAFI, rather than
reusing the MCAST-VPN SAFI. (…)
<Kurt>The reason for choosing to use a new SAFI is for support of the
BSR routes to be negotiated by SAFI. If we were to find a way to leverage
end-of-rib signaling in routine sets of NLRI updates, a separate SAFI
would have other benefits (but see the end-of-rib comments below).
Alternately, it may be possible to simply rely on implementations
to properly discard NLRIs of unknown types within the MCAST-VPN SAFI.
> The draft seems to allow an AFI of "IPv6" to be used together with an IPv4
BSR address, or vice versa. (…)
<Kurt>This is not the intent. The only intended use of "mixed" AFIs would
be in the "Originating PE's IP Address", which is the PE's address
in the service provider core and may be a different AFI than the AFI
of the customer network for which this SAFI is advertising BSR.
For consistency, we should adopt the convention of RFC 6515 here
as the issues are the same. We should also explicitly require that
the AFI of other IP addresses in the NLRIs match the AFI of the NLRI.
Current language describing the possible values of "len of BSR
address" and other similar passages must be tightened accordingly.
> For RP addresses and Group addresses the draft proposes to use the
"encoded" formats from the PIM spec. These formats contain an octet that
identifies the address family. There should be a requirement that the
address family as encoded in the "encoded format" be the same as the
address family identified in the BGP Update's AFI, and that the lengths be
appropriate for that address family.
<Kurt>Agreed.
> In the BGP Update, parsing would be simpler if the Length field that
precedes an "encoded group format" field or an "encoded unicast address
field" contains the length of that field, not the length of the address
prefix that appears within the encoded format.
<Kurt>Agreed.
> The mention of the "VRF Route Import Extended Community" in section 4.1
should say "VRF Route Import Extended Community" or "VRF Route Import IPv6
Address Specific Extended Community", to cover the case of a SP with an
IPv6 infrastructure. (It also needs to be made clear that this applies to
the NLRI of sections 4.2 and 4.3 as well.)
<Kurt>Agreed.
> What action is to be taken if a BGP Update with an MCAST-VPN-BSR NLRI is
received, but there is no BSR-BGP Path attribute?
<Kurt>Since all BGP-BSR NLRI route types must carry the BGP-BSR Path
attribute per the current text, I think the proper response to a
route update lacking this attribute is to declare it malformed and
require that the NLRI be discarded without processing of its contents.
> It's hard to interpret phrases like "the group count for this NLRI is not
set". How does one send this attribute without "setting" all its fields?
Does "not set" just mean "set to zero", or does it mean only that certain
fields are irrelevant to the processing of certain NLRIs.
<Kurt>While it's true that certain fields are not relevant to all NLRIs
(see the next comment), this language can be clarified to be "set to zero."
The next comment does a nice job of defining these behaviors.
> The draft could use a little table to show which fields affect the
processing which received NLRIs:
(…)
Where a particular NLRI/field combination is "No", perhaps what the draft
should say is that the field MUST be ignored when processing that type of
NLRI. That would allow the ignored field to carry any value, without risk
of any interoperability problems. If one only says "SHOULD be ignored",
there may be interoperability problems.
<Kurt>This is correct and the draft should adopt these excellent suggestions.
> Regarding Fragmentation Tags
There don't seem to be any clear instructions as to when the fragmentation
tag field of the BSR-BGP attribute of a given NLRI actually needs to be
changed. (…) why can't fragmentation be entirely a local matter
(i.e., not communicated across the net)?
<Kurt>This is a good observation and we agree that the BSR Fragment Tag is
not needed in BGP. It should be removed and the table from the previous
comment can also be updated accordingly.
> Constructing BSMs from the Counts
Suppose an ingress PE receives a BSM with 15 RP mappings for a given
group. Then it receives another BSM with 15 RP mappings for that group,
10 of which are the same, and 5 of which are different.
It seems that if the egress PE receives "withdraw, update, withdraw,
update, withdraw, update, withdraw, update, withdraw, update", it could
generate five BSMs. Is our understanding correct, or are we missing
something?
<Kurt>The egress PE may receive BGP NLRI updates as described above, such that
the counts contained in the NRLIs do not give the ability to determine
complete reception of the full set of updates that came from the BSM on the
ingress PE.
The worst-case would be that each withdraw and new update arrives in a
separate BGP UPDATE message, which depending on the implementation may
trigger BSR to attempt to generate multiple BSMs to CEs.
This is a worst-case scenario. The NRLI design attempts to account for
the full set of updates using the various counters so that it can know
when it can send a complete BSM. But it must be acknowledged that the
update stream that you give is one in which this protocol cannot operate
as well as native BSR.
However, it's likely that the ingress PE would batch the individual NLRI
updates into a much fewer number of BGP UPDATE messages.
The draft could make recommendations for the implementation to process all
BGP NLRI updates that it has received before generating the new BSM to
minimize the behavior above.
One more point is that even if the NRLI updates arrive at the egress in
separate BGP UPDATE messages at different times, triggering separate
attempts to send BSMs, the BSR spec requires a minimum time between BSMs
being sent (RFC 5059 specifies BS_min_interval with default of 10sec).
Thus, it's likely that the second BSM would contain the updates to all
15 mappings and the worst-case you described is mitigated. The issue
then is the added delay to convergence introduced by the BS_min_interval
so it is still not ideal in comparison to native BSR.
> End of RIB
Given that there is almost always a route reflector between the ingress
and egress PEs, how is the "End of RIB" marker going to be helpful in
deciding when to originate a BSM?
<Pavan+Kurt>Thanks for pointing this out, that is correct. Also, given that
End of RIB implementations may differ, we will alter the draft
in our next version accordingly to remove it.
> BS_Timeout
There seems to be a problem with the following procedure from section
6.2.2 ("Missing BSM") (…)
<Kurt>First off, I think this draft needs to clarify how it integrates with
the BSR state machine in the same terms of RFC5059. The precise events and
actions on the PEs needs to be specified so that it's clear how BGP events
on PEs interact with the local RP-Set and PE-CE BSMs.
Regarding the "Missing BSM" behavior, we need specify the following,
which will revise the original text and procedures:
* On the ingress PE, a timer must be run at the interval of
BS_Period (+ some small value to allow for jitter) such that if a BSM
has not been received from the local BSR, then the BSR Parameters NLRI.
* On the egress PE, upon withdrawal of the BSR parameters NLRI, the PE
must set its BS_Timeout to be BS_Timeout - BS_Period and BSMs must not
be sent while the BSR Parameters NLRI is not present.
Thus, if timers are in sync and allowed jitter is small, behavior should
mirror what would happen in native BSR from the points of view of CEs
on the egress side.
However, since we cannot rely on timers being perfectly in sync, it is
still possible for the egress site to retain the RPs from the missing
BSM for up to one BS_Period.
This problem needs more thought to see if we can close this gap
in relation to the native BSR operations.
> With regard to the sentence "As soon as the Type-1 is withdrawn,
BS_Timeout period has to be started at the egress and upon its expiry, all
the Type-2 and Type-3 entries MUST be deleted", it doesn't seem right for
an egress PE to remove BGP installed routes based upon the expiry of a
local timer.
<Kurt>Yes, this should read something to the effect that when BS_Timeout
expires on the egress PE, the RPs/groups discovered from the elected BSR
should be removed from the local RP-Set. The BGP routes will be deleted
when the ingress PE withdraws them.
Thanks again for the valuable comments.
We hope to produce a revisied draft in the coming weeks.
Thanks again
--Kurt
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Eric
Rosen
Sent: Monday, June 24, 2013 2:51 PM
To: Pavan Kurapati
Cc: [email protected]; [email protected]; [email protected]
Subject: Comments on draft-kurapati-dynamicrp-bgpmvpn-00.txt
Stig Venaas and I have discussed draft-kurapati-dynamicrp-bgpmvpn-00.text,
and together we have prepared the following set of questions and comments.
- We are wondering why the draft proposes to use a new SAFI, rather than
reusing the MCAST-VPN SAFI.
Using a new SAFI does provide a bit more freedom in designing the NLRI,
but the draft sticks to the basic NLRI format of the MCAST-VPN SAFI
anyway. I don't think the new SAFI would be used on a BGP session unless
the MCAST-VPN SAFI is also used on that session, so why not just use the
MCAST-VPN SAFI for the new route types?
- The draft seems to allow an AFI of "IPv6" to be used together with an IPv4
BSR address, or vice versa.
When using the type 1-7 routes of the MCAST-VPN SAFI, the AFI designates
the address family being used by the customer. The address family being
used by the service provider is inferred from the various length
computations that are discussed in RFC 6515. It seems best to stick with
that same convention for the new BSR route types. That would mean that
the field "BSR Address" must be of the address family identified by the
AFI. The address length would then have to be appropriate for that
address family, or the Update would be considered malformed.
On the other hand, if it were to be decided to use a new SAFI, it might
make more sense to dispense with the RFC 6515 hacks altogether and
explicitly encode the address family of each address.
- For RP addresses and Group addresses the draft proposes to use the
"encoded" formats from the PIM spec. These formats contain an octet that
identifies the address family. There should be a requirement that the
address family as encoded in the "encoded format" be the same as the
address family identified in the BGP Update's AFI, and that the lengths be
appropriate for that address family.
- In the BGP Update, parsing would be simpler if the Length field that
precedes an "encoded group format" field or an "encoded unicast address
field" contains the length of that field, not the length of the address
prefix that appears within the encoded format.
- The mention of the "VRF Route Import Extended Community" in section 4.1
should say "VRF Route Import Extended Community" or "VRF Route Import IPv6
Address Specific Extended Community", to cover the case of a SP with an
IPv6 infrastructure. (It also needs to be made clear that this applies to
the NLRI of sections 4.2 and 4.3 as well.)
- What action is to be taken if a BGP Update with an MCAST-VPN-BSR NLRI is
received, but there is no BSR-BGP Path attribute?
- It's hard to interpret phrases like "the group count for this NLRI is not
set". How does one send this attribute without "setting" all its fields?
Does "not set" just mean "set to zero", or does it mean only that certain
fields are irrelevant to the processing of certain NLRIs.
- The draft could use a little table to show which fields affect the
processing which received NLRIs:
NLRI FragTag RP Count Group Count
BSR Parameters Yes No Yes
BSM Group Parameters Yes Yes No
BSM RP Parameters Yes No No
I think this table corresponds to your intentions.
Where a particular NLRI/field combination is "No", perhaps what the draft
should say is that the field MUST be ignored when processing that type of
NLRI. That would allow the ignored field to carry any value, without risk
of any interoperability problems. If one only says "SHOULD be ignored",
there may be interoperability problems.
- Regarding Fragmentation Tags
There don't seem to be any clear instructions as to when the fragmentation
tag field of the BSR-BGP attribute of a given NLRI actually needs to be
changed. As a result, it's difficult to figure out its uses. If some
customer is sending fragmented BSMs every minute, one doesn't want to have
BGP update all its RP mappings every minute. So just when does the
attribute value have to change? Hopefully not too often, or there will be
a lot of BGP thrashing.
It's difficult to understand why a fragmentation tag field is needed in the
BSR-BGP attribute at all. The Group Count and RP Count fields are really
what control when an egress PE can send a BSM. If an ingress PE doesn't
advertise changes to a groups RP mappings until it has all the mappings
for that group (which I think is required in BSR), why can't fragmentation
be entirely a local matter (i.e., not communicated across the net)? What
are we missing?
- Constructing BSMs from the Counts
Suppose an ingress PE receives a BSM with 15 RP mappings for a given
group. Then it receives another BSM with 15 RP mappings for that group,
10 of which are the same, and 5 of which are different.
It seems that if the egress PE receives "withdraw, update, withdraw,
update, withdraw, update, withdraw, update, withdraw, update", it could
generate five BSMs. Is our understanding correct, or are we missing
something?
- End of RIB
Given that there is almost always a route reflector between the ingress
and egress PEs, how is the "End of RIB" marker going to be helpful in
deciding when to originate a BSM?
- BS_Timeout
There seems to be a problem with the following procedure from section
6.2.2 ("Missing BSM"):
"Egress PE receiving a withdrawn "BSR Parameters" route (Type-1)
MUST still keep the corresponding Type-2 and Type-3 entries.
However, it MUST NOT advertise the BSM to the CE without the
Type-1 route present. As soon as the Type-1 is withdrawn,
BS_Timeout period has to be started at the egress and upon its
expiry, all the Type-2 and Type-3 entries MUST be deleted.
Say the egress has generated BSM at t=0. At t=1 BS_Period
expired at ingress PE and ingress PE did not get the periodic
BSM. So, it withdraws type-1 (BSR Parameters). Egress PE has
already generated BSM just before the type-1 withdrawal was
received. The egress PE skips the next periodic BSM towards the
CE. But CE is "off" by BS_Period interval by now. Once the
BS_Timeout expires, egress PE removes all the type-2 and type-3
entries. CEs connected to egress PE will remove the same, a
whole BS_Period later. Hence, to avoid this issue, once the
BS_Timeout expires,an egress PE MUST generate a new BSM towards
CE with RP hold time set to "0" for all the type-2 and type-3
entries. This will make the CEs in sinc with the the PEs. After
generating the BSM, PE removes all the Type-2 and Type-3 entries
as stated above.
The problem is the following. The holding times of the individual RP
mapping entries may be longer than the BS_Timeout. Typically if
BS_Timeout fires, the remaining holding time of an RP mapping entry will
be the difference between (a) its holding time as reported in the last
received BSM and (b) BS_Timeout. The above seems to set the RP holding
times to zero as soon as BS_Timeout expires. The problem with this is
that it may cause the RP mappings to timeout before a new BSR can be
elected.
Perhaps the withdrawal of a BSR parameters route should trigger the
transmission of a new BSM that doesn't set the RP-mapping holding times
to zero, but that just reduces each RP-mapping holding time by
BS_Timeout. Well, that would correct the RP-mapping holding times
downstream of an egress PE, but it would also have the side effect of
restarting the BS_Timeout at the routers downstream of the egress PE. So
that doesn't seem right either.
- With regard to the sentence "As soon as the Type-1 is withdrawn,
BS_Timeout period has to be started at the egress and upon its expiry, all
the Type-2 and Type-3 entries MUST be deleted", it doesn't seem right for
an egress PE to remove BGP installed routes based upon the expiry of a
local timer.