Gunter Van de Velde has entered the following ballot position for draft-ietf-grow-bmp-bgp-rib-stats-14: Discuss
When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.) Please refer to https://www.ietf.org/about/groups/iesg/statements/handling-ballot-positions/ for more information about how to handle DISCUSS and COMMENT positions. The document, along with other ballot positions, can be found here: https://datatracker.ietf.org/doc/draft-ietf-grow-bmp-bgp-rib-stats/ ---------------------------------------------------------------------- DISCUSS: ---------------------------------------------------------------------- # Gunter Van de Velde, RTG AD, comments for draft-ietf-grow-bmp-bgp-rib-stats-14 # The line numbers used are rendered from IETF idnits tool: https://author-tools.ietf.org/api/idnits?url=https://www.ietf.org/archive/id/draft-ietf-grow-bmp-bgp-rib-stats-14.txt # Many thanks for the RTGDIR review from Bruno and the shepherd writeup from Job. # Did i miss seeing a cross posting to IDR/BESS to understand if the various suggested gauges definitions are accurately described and understood from protocol perspective. # DISCUSS # ======= #1# the section "5. Operational Considerations" seems to document a mix of operational considerations (non BCP14 language required) and GMP protocol formal procedures (BCP14 language is required). Can these two be untangled. It will make it easier for implementors to do the correct implementation. #2# In general i found the descriptions of most of the gauges for the newly proposed statistics types not very accurately described. See my ""COMMENT"" section for input and overview. Too lengthy in the overview DISCUSS section #3# some gauges seem duplicates from prior existing gauges. Not sure we need two times the same gauge in different code-points. seems sub-optimal and error prone. #4# section 5 is named "Statistics Definition" and that seems not aa well described title. Can this be something that better describes the content? for example "RIB monitoring type statistics" #5# it was unclear to me that what the document specifies is that the gauge that is formalized in this document is not simply a single dimensional gauge alone, but that the value transferred by BMP is a combination of "AFI + SAFI + gauge". I think i missed seeing that explicitly mentioned in the document. Adding lengths (in general introduction section maybe to avoid repetition) of each field would help making sure implementations interop well. ---------------------------------------------------------------------- COMMENT: ---------------------------------------------------------------------- # comments # ======== 19 This document defines new statistics type to monitor BMP Adj-RIB-In 20 and Adj-RIB-Out Routing Information Bases (RIBs). GV> in the abstract is mentioned that the document defines new statistics (but later is mentioned it are guages) 86 This document defines new gauges for BMP statistics message. GV> The above does not fully align with what is written in the abstract, I suspect you want to say: " This document defines gauges for new BMP statistics messages. " 107 * Pre-policy Adj-RIB-In: The result before applying the inbound 108 policy to an Adj-RIB-In. Note that this aligns with the pre- 109 policy Adj-RIB-In concept specified in Section 2 of [RFC7854]. GV> Why is the text from RFC7854 not re-used? is there need for a new explicit definition? GV> RFC7854 says: " o Adj-RIB-In: As defined in [RFC4271], "The Adj-RIBs-In contains unprocessed routing information that has been advertised to the local BGP speaker by its peers." This is also referred to as the pre-policy Adj-RIB-In in this document. " 127 * Primary route: A route to a prefix that is considered the best 128 route by the BGP decision process [RFC4271] and actively used for 129 forwarding traffic to that prefix. GV> is this accurate? is it not the BGP route that is selected by BGP for being forwarded to its peers? There may be ECMP or uECMP routes actively used 131 * Backup route: A backup route is eligible for route selection, but 132 it is not selected as the primary route and is also installed in 133 the Loc-RIB. It is not used until all primary routes become 134 unreachable. Backup routes are used for fast convergence in the 135 event of failures. GV> here is the concept of "all primary routes" used, indicating more as a single best route. Is this not contradicting the prior bullet point? 137 3. Statistics Definition GV> This title seems rather undescriptive. What about calling this section: " RIB monitoring type statistics " 145 * Type = 18: (64-bit Gauge) Current number of routes in pre-policy 146 Adj-RIB-In. This gauge is similar to stats type 7 defined in 147 [RFC7854] and makes it explicitly for the pre-policy Adj-RIB-In. GV> It is written that this is similar as stats type 7, but when looking at the definitions in section 2 it is exactly the same. pre-existing stats type 7 is exactly the same as the proposed stats type 18. Do we need type 18? 149 * Type = 19: (64-bit Gauge) Current number of routes in per-Address 150 Family Identifier (AFI)/Subsequent Address Family Identifier 151 (SAFI) pre-policy Adj-RIB-In. This gauge is similar to stats type 152 9 defined in Section 4.8 of [RFC7854] and makes it explicitly for 153 the pre-policy Adj-RIB-In. The value is structured as: 2-byte 154 AFI, 1-byte SAFI, followed by a 64-bit Gauge. GV> same observation as the prior item. The newly suggested type 19 is exactly the same as type 9. Do we need this new gauge? GV> what exactly is the "value"? Can the structure of the field be more clarified? how how is the field encoded? it seems more as a single dimensional 64 bit gauge. GV> first time usage of the AFI/SAFI in this document and adding a reference can be handy. Also maybe a list of AFI/SAFI this is intended for if this is only for a subset of them. 159 * Type = 21: (64-bit Gauge) Current number of routes in per-AFI/SAFI 160 post-policy Adj-RIB-In. The value is structured as: 2-byte AFI, 161 1-byte SAFI, followed by a 64-bit Gauge. GV> what exactly is the "value"? Can the structure of the field be more clarified? how how is the field encoded? it seems more as a single dimensional 64 bit gauge. 163 * Type = 22: (64-bit Gauge) Current number of routes in per-AFI/SAFI 164 rejected by inbound policy. This gauge is different from stats 165 type 0 defined in Section 4.8 of [RFC7854]. The stats type 0 is a 166 32-counter which is a monotonically increasing number and doesn't 167 represent the current number of routes rejected by an inbound 168 policy due to ongoing configuration changes. The value is 169 structured as: 2-byte AFI, 1-byte SAFI, followed by a 64-bit 170 Gauge. GV> If over time more and more routes are rejected, then how can the number of rejected routes go ever go down? its an increasing number. Unless there is assumption that there is accounting for the changing number of routes received/withdrawn by a peer and it is the number of routes that were rejected from the number of routes received. This may need more accurate definition of what exactly is being measured and what reference is used. 172 * Type = 23: (64-bit Gauge) Current Number of routes in per-AFI/SAFI 173 accepted by inbound policy. The value is structured as: 2-byte 174 AFI, 1-byte SAFI, followed by a 64-bit Gauge. Some 175 implementations, or configurations in implementations, may discard 176 routes that do not match policy and thus the accepted count (type 177 23) and the Adj-RIB-In counts (type 21) will be identical in such 178 cases. GV> not sure what is the text starting with "Some implementations, or ..." helps with the formal definition of the field. It is useful from operational perspective, but it convolutes the formal part of the definition of the field itself. Maybe move to operational implication section 180 * Type = 24: (64-bit Gauge) Current Number of routes in per-AFI/SAFI 181 selected as primary route. The value is structured as: 2-byte 182 AFI, 1-byte SAFI, followed by a 64-bit Gauge. GV> the primary route is the route forwarding traffic? does this include all ECMP and uECMP paths. BGP will only fwd the best BGP Path, but it may use more as a single path for forwarding 184 * Type = 25: (64-bit Gauge) Current Number of routes in per-AFI/SAFI 185 selected as a backup route. The value is structured as: 2-byte 186 AFI, 1-byte SAFI, followed by a 64-bit Gauge. GV> does this include all routes that are not the BGP best path or only the routes that are not used for forwarding? What makes a route a "backup" route. 195 * Type = 27: (64-bit Gauge) Current Number of routes in per-AFI/SAFI 196 marked as stale by Graceful Restart (GR) events. The value is 197 structured as: 2-byte AFI, 1-byte SAFI, followed by a 64-bit 198 Gauge. 'Stale' refers to a path which has been declared stale by 199 the BGP GR mechanism as described in Section 4.1 of [RFC4724]. GV> GR events happen when a CPM moves from a primary unit to a standby unit/process. Such involves significant processing. Hence i wonder how mush operational value this brings, or if would make the GR event worse then it already is. 201 * Type = 28: (64-bit Gauge) Current Number of routes in per-AFI/SAFI 202 marked as stale by Long-Lived Graceful Restart (LLGR). The value 203 is structured as: 2-byte AFI, 1-byte SAFI, followed by a 64-bit 204 Gauge. 'Stale' refers to a path which has been declared stale by 205 the BGP LLGR mechanism as described in Section 4.3 of [RFC9494]. GV> see prior comments 211 * Type = 30: (64-bit Gauge) Current Number of routes per-AFI/SAFI 212 left until reaching the received route threshold which corresponds 213 to the upper bound of accepted routes per Section 6.7 of 214 [RFC4271]. The value is structured as: 2-byte AFI, 1-byte SAFI, 215 followed by a 64-bit Gauge. GV> Is this accurate? multiprotocol extensions are described in RFC4760 and not in RFC4271. It is unclear how this counter referencing rfc4271 is to be applied to rfc4760 when multiple afi/safi may be received from a single peer. 217 * Type = 31: (64-bit Gauge) Current Number of routes left until 218 reaching a license-customized route threshold. This value is 219 affected by whether a customized license exists, and when the 220 customized license is installed. GV> This may be a soft threshold and in addition may be enforced outside the router knowledge. 222 * Type = 32: (64-bit Gauge) Current Number of routes in per-AFI/SAFI 223 left until reaching a license-customized route threshold. This 224 value is affected by whether a customized license exists for the 225 relevant address family, and when the customized license is 226 installed. The value is structured as: 2-byte AFI, 1-byte SAFI, 227 followed by a 64-bit Gauge. GV> This may be a soft threshold and in addition may be enforced outside the router knowledge. 264 * Type = 39: (64-bit Gauge) Current number of routes refused to be 265 sent by exceeding the maximum AS_PATH length supported by the 266 local configuration. GV> can this be more accurate described? Is it "refused to be sent" or simply "not sent" because route AS_PATH is longer as max AS_PATH length towards the peer? 268 * Type = 40: (64-bit Gauge) Current number of routes in per-AFI/SAFI 269 refused to be sent by exceeding the maximum AS_PATH length 270 supported by the local configuration. The value is structured as: 271 2-byte AFI, 1-byte SAFI, followed by a 64-bit Gauge. GV> See prior comment. I do not think 'refused' is the most accurate word to use.... maybe filtered is a better term to use? 328 5. Operational Considerations GV> Some of the definitions earlier have operational concerns included and are maybe better added to the operational implication section. 330 This document defines new gauges for BMP statistics messages. The GV> i think more accurate would be that the document specifies gauges for "new BMP statistics". 333 implementation-dependent. Implementations SHOULD determine 334 appropriate report generation and delivery strategies, including 335 configurable timing intervals and threshold values. The mechanism 336 for controlling the reporting of new gauges SHOULD be consistent with 337 that of existing types. Implementations SHOULD also support per- 338 router configuration of statistic subsets for collection and 339 reporting. GV> Why is this uppercase SHOULD? Is there a procedure that breaks? lowercase seems sufficient as its documenting good behavior. 341 Some statistics are dependent on feature configurations, such as GR, 342 LLGR, and RPKI, so the corresponding statistics are only sent when 343 these features are enabled. This statistics include Type 24, 25, 26, GV> From operational perspective sending BGP Stats during a GR may impact the GR event due to additional processing and dynamics. That is an operational concern. 351 Certain statistics may have logical relationships (e.g., per-AFI/SAFI 352 counts summing to global totals). Implementations MAY perform 353 consistency checks but MUST NOT assume strict dependencies (due to 354 potential race conditions or partial failures). Discrepancies (e.g., 355 sum(per-AFI/SAFI) != global count) SHOULD be logged as warnings but 356 MUST NOT disrupt protocol operation. GV> not convinced these need to be BCP14 language. is BCP14 language required? 358 For backward compatibility, and absent policy otherwise, it is 359 RECOMMENDED that monitored routers capable of generating both (Type 7 360 and Type 18) or (Type 9 and Type 19) BMP statistics SHOULD transmit 361 both corresponding types simultaneously. This allows monitoring 362 stations to process either format according to their needs without 363 disrupting existing implementations that rely on Type 7 or Type 9. GV> In what way are The new types different from the prior types. its the exact same value representing the exact same property. 369 Counters may reset due to session restart, manual clearance, or 370 overflow. Implementations MUST track discontinuities and log this 371 information. GV> This document specifies gauges, not counters. Is this accurate usage of words? is BCP14 language correct? seems not to be about formal protocol procedure 373 Operators MAY consider rate-limiting statistic updates to minimize 374 performance impact on control-plane processes. Operators SHOULD 375 enable only necessary statistics to reduce memory and CPU overhead. GV> lowercase should/may seems sufficient Many thanks for this document, Kind Regards, Gunter Van de Velde RTG Area Director _______________________________________________ GROW mailing list -- [email protected] To unsubscribe send an email to [email protected]
