Stig Venaas and I have discussed draft-kurapati-dynamicrp-bgpmvpn-00.text,
and together we have prepared the following set of questions and comments.
- We are wondering why the draft proposes to use a new SAFI, rather than
reusing the MCAST-VPN SAFI.
Using a new SAFI does provide a bit more freedom in designing the NLRI,
but the draft sticks to the basic NLRI format of the MCAST-VPN SAFI
anyway. I don't think the new SAFI would be used on a BGP session unless
the MCAST-VPN SAFI is also used on that session, so why not just use the
MCAST-VPN SAFI for the new route types?
- The draft seems to allow an AFI of "IPv6" to be used together with an IPv4
BSR address, or vice versa.
When using the type 1-7 routes of the MCAST-VPN SAFI, the AFI designates
the address family being used by the customer. The address family being
used by the service provider is inferred from the various length
computations that are discussed in RFC 6515. It seems best to stick with
that same convention for the new BSR route types. That would mean that
the field "BSR Address" must be of the address family identified by the
AFI. The address length would then have to be appropriate for that
address family, or the Update would be considered malformed.
On the other hand, if it were to be decided to use a new SAFI, it might
make more sense to dispense with the RFC 6515 hacks altogether and
explicitly encode the address family of each address.
- For RP addresses and Group addresses the draft proposes to use the
"encoded" formats from the PIM spec. These formats contain an octet that
identifies the address family. There should be a requirement that the
address family as encoded in the "encoded format" be the same as the
address family identified in the BGP Update's AFI, and that the lengths be
appropriate for that address family.
- In the BGP Update, parsing would be simpler if the Length field that
precedes an "encoded group format" field or an "encoded unicast address
field" contains the length of that field, not the length of the address
prefix that appears within the encoded format.
- The mention of the "VRF Route Import Extended Community" in section 4.1
should say "VRF Route Import Extended Community" or "VRF Route Import IPv6
Address Specific Extended Community", to cover the case of a SP with an
IPv6 infrastructure. (It also needs to be made clear that this applies to
the NLRI of sections 4.2 and 4.3 as well.)
- What action is to be taken if a BGP Update with an MCAST-VPN-BSR NLRI is
received, but there is no BSR-BGP Path attribute?
- It's hard to interpret phrases like "the group count for this NLRI is not
set". How does one send this attribute without "setting" all its fields?
Does "not set" just mean "set to zero", or does it mean only that certain
fields are irrelevant to the processing of certain NLRIs.
- The draft could use a little table to show which fields affect the
processing which received NLRIs:
NLRI FragTag RP Count Group Count
BSR Parameters Yes No Yes
BSM Group Parameters Yes Yes No
BSM RP Parameters Yes No No
I think this table corresponds to your intentions.
Where a particular NLRI/field combination is "No", perhaps what the draft
should say is that the field MUST be ignored when processing that type of
NLRI. That would allow the ignored field to carry any value, without risk
of any interoperability problems. If one only says "SHOULD be ignored",
there may be interoperability problems.
- Regarding Fragmentation Tags
There don't seem to be any clear instructions as to when the fragmentation
tag field of the BSR-BGP attribute of a given NLRI actually needs to be
changed. As a result, it's difficult to figure out its uses. If some
customer is sending fragmented BSMs every minute, one doesn't want to have
BGP update all its RP mappings every minute. So just when does the
attribute value have to change? Hopefully not too often, or there will be
a lot of BGP thrashing.
It's difficult to understand why a fragmentation tag field is needed in the
BSR-BGP attribute at all. The Group Count and RP Count fields are really
what control when an egress PE can send a BSM. If an ingress PE doesn't
advertise changes to a groups RP mappings until it has all the mappings
for that group (which I think is required in BSR), why can't fragmentation
be entirely a local matter (i.e., not communicated across the net)? What
are we missing?
- Constructing BSMs from the Counts
Suppose an ingress PE receives a BSM with 15 RP mappings for a given
group. Then it receives another BSM with 15 RP mappings for that group,
10 of which are the same, and 5 of which are different.
It seems that if the egress PE receives "withdraw, update, withdraw,
update, withdraw, update, withdraw, update, withdraw, update", it could
generate five BSMs. Is our understanding correct, or are we missing
something?
- End of RIB
Given that there is almost always a route reflector between the ingress
and egress PEs, how is the "End of RIB" marker going to be helpful in
deciding when to originate a BSM?
- BS_Timeout
There seems to be a problem with the following procedure from section
6.2.2 ("Missing BSM"):
"Egress PE receiving a withdrawn "BSR Parameters" route (Type-1)
MUST still keep the corresponding Type-2 and Type-3 entries.
However, it MUST NOT advertise the BSM to the CE without the
Type-1 route present. As soon as the Type-1 is withdrawn,
BS_Timeout period has to be started at the egress and upon its
expiry, all the Type-2 and Type-3 entries MUST be deleted.
Say the egress has generated BSM at t=0. At t=1 BS_Period
expired at ingress PE and ingress PE did not get the periodic
BSM. So, it withdraws type-1 (BSR Parameters). Egress PE has
already generated BSM just before the type-1 withdrawal was
received. The egress PE skips the next periodic BSM towards the
CE. But CE is "off" by BS_Period interval by now. Once the
BS_Timeout expires, egress PE removes all the type-2 and type-3
entries. CEs connected to egress PE will remove the same, a
whole BS_Period later. Hence, to avoid this issue, once the
BS_Timeout expires,an egress PE MUST generate a new BSM towards
CE with RP hold time set to "0" for all the type-2 and type-3
entries. This will make the CEs in sinc with the the PEs. After
generating the BSM, PE removes all the Type-2 and Type-3 entries
as stated above.
The problem is the following. The holding times of the individual RP
mapping entries may be longer than the BS_Timeout. Typically if
BS_Timeout fires, the remaining holding time of an RP mapping entry will
be the difference between (a) its holding time as reported in the last
received BSM and (b) BS_Timeout. The above seems to set the RP holding
times to zero as soon as BS_Timeout expires. The problem with this is
that it may cause the RP mappings to timeout before a new BSR can be
elected.
Perhaps the withdrawal of a BSR parameters route should trigger the
transmission of a new BSM that doesn't set the RP-mapping holding times
to zero, but that just reduces each RP-mapping holding time by
BS_Timeout. Well, that would correct the RP-mapping holding times
downstream of an egress PE, but it would also have the side effect of
restarting the BS_Timeout at the routers downstream of the egress PE. So
that doesn't seem right either.
- With regard to the sentence "As soon as the Type-1 is withdrawn,
BS_Timeout period has to be started at the egress and upon its expiry, all
the Type-2 and Type-3 entries MUST be deleted", it doesn't seem right for
an egress PE to remove BGP installed routes based upon the expiry of a
local timer.