Hi Jeff, Sorry for my late response in line with [Tiger]
发件人: Jeffrey Haas <[email protected]> 日期: 星期三, 2026年5月6日 22:29 收件人: Tiger Xu <[email protected]> 抄送: BESS <[email protected]>, idr@ietf. org <[email protected]> 主题: Some comments on carrying bandwidth in BGP, and also on draft-xu-idr-fare-04 (was Re: [Idr] Working Group Last Call on draft-ietf-bess-ebgp-dmz) [Speaking as an individual contributor in this response] > On May 6, 2026, at 03:17, Tiger Xu <[email protected]> wrote: > In essence, this changes the extended community from non‑transitive to > transitive and introduces the concept of bandwidth aggregation – both of > which were already present in draft-xu-idr-fare version -00. First, a few words on the general use cases of carrying bandwidth/capacity in BGP routes: The link-bandwidth feature, and its varying uses over the years, and the varying transitivities used for it[1] have been a long mess. At a fundamental level, the feature of "we've sent a value, apply a multipath ratio across all paths for that destination based on the received values" has been broadly consistent across the implementations. As the use cases started to split underlay vs. overlay topology and how multipath was handled at each layer and its interaction load balancing became messier. One could observe that there are benefits for splitting the feature carrying the signaling for the bandwidth/capacity based on the role the routes are intended to serve, and also where they are applied. The fact that the "role" of a given route is generally clear in most BGP contexts where BGP is carrying the underlay routing has made it less of a deployment problem to use the same signaling mechanism for both underlay and overlay. However, it does mean that in places where "math" on those values has been necessary that having an overloaded signaling mechanism complicates implementation and operational logic. The ready example covered in many places is that having hop-by-hop underlay bandwidth capacity is great for load balancing across nexthops. However, when it comes time to consider multipath load balancing for individual overlay/service routes passing over links of disparate capacities, there tends to be a need to apply math based on the desired network-wide load balancing. Is it that you want to have a receiver acquire the minimal functional bandwidth that path can use? Or, is it a ratio for traffic to be broadly load balanced behind a set of paths? And certainly there are more use cases. The various use cases have been broadly solved on a single signaling mechanism and - frustrating to some - by operational paradigm and discipline. Simply having more than one signaling mechanism would offer some flexibility to operators and implementors. This has been mentioned in multiple contexts over the years. [Tiger] I fully agree with your opinion. There has also been appropriate criticism of the link-bw encoding. The choice of IEEE 754 32-bit floating point numbers provided a useful way to carry big numbers across BGP in an existing encoding - extended communities. However, the poor granularity of that type for the numbers we use these days in networks leads to mostly operational issues. For example, you can configure one number and the closest rounded number is what is encoded on the wire. Similarly, how do do policy on numbers where rounding may be in place? And finally, such numbers don't encode or interact nicely with YANG. There have been some proposals to simply change the encoding to get us out of this particular bit of unpleasantness. I think there is room for further work to provide for a less insane encoding. However, that will also lead us to figuring out how a new such mechanism (possibly a new community) interops with the existing stuff. Since most of the use cases for link-bw are satisfied with being a ratio rather than carrying precise numbers, the pressure to address the deficiencies above hasn't been high. However, once there's a desire for more precise capacity encoding, we'll likely see the appropriate mechanisms being proposed for those use cases - and those use cases may overlap with the existing ones. I think there's more room for work to provide cleaner separation of overlay and underlay use cases. In this respect, I'm supportive of continuing discussion on the work you've begun with FARE. But like the other comments above, much of that discussion will be whether a separate signaling mechanism makes our lives easier at the implementation and at the operations level. I look forward to that discussion. [Tiger] Although the current version of the link-bandwidth draft has eliminated several limitations associated with the link-bandwidth extended community as mentioned in Section 1.1 of the FARE draft, the requirement on the use of both transitive and non-transitive link-bandwidth extended communities is not suitable for the FARE, especially in a 5-stage CLOS environment (see Section 4.2 of the FARE draft for more details). "Generally, a single Link Bandwidth Extended Community of the transitivity type desired in a deployment is attached to a route. However during transition (refer Section 7<https://datatracker.ietf.org/doc/html/draft-ietf-idr-link-bandwidth-22#Operational_Condiderations> for details), a BGP speaker MAY attach one Link Bandwidth Extended Community per transitivity (transitive/non-transitive); the bandwidth value field in both communities SHOULD be the same.” [Tiger] In a word, the use of both transitive and non-transitive types for the link-bandwidth attributes in the link-bandwidth draft is to enable interoperation between old and new implementations, whereas the use of both transitive and non-transitive types for the path-bandwidth attribute is to distinguish two different kinds of path bandwidth values (see Section 4.2 of the FARE draft for more details). Defining two distinct bandwidth-specific attributes—one for DMZ external link bandwidth and another for path bandwidth—would simplify matters, unless the availability of sub-type values is extremely scarce, IMHO. A few terse technical comments on the draft itself: Section 3: Your requested encoding is impossible in RFC 4360 extended communities. You have six octets to work with. You both global and local-admin fields that require 4-octets each. [Tiger] In our current implementation, the 2-byte local-admin field of the IPv4-address-specific extended community is filled with the path bandwidth value in units of GB/s, using the IEEE 16-bit half-precision floating-point format. However, your above comment makes me rethink the possibility of using the ASN-specific extended community, which has a 4-byte field to convey the path bandwidth value in the IEEE 32-bit single-precision floating-point format. Security/Operational considerations: Your desire in this draft is to use transitive extended communities. Unlike the hop-by-hop (re-)generated non-transitive extended communities used by DMZ, you have attribute escape issues to address: - If a given node doesn't "do math" on the community because it doesn't understand it, how does that impact the use case? - You need to protect the deployment against receiving such communities from outside the deployment. - You need to discuss how you remove the communities when the routes are being sent outside the deployment. [Tiger] The path bandwidth attribute is targeted for AI back-end data center network scenarios, and therefore there is no such risk. Anyway, I will add some text to explain this consideration. Best regards, Tiger Some of these considerations are already addressed as part of the link-bw document. -- Jeff [1] Juniper issued the first version as non-transitive, and then immediately started shipping code where it was transitive while squatting on the transitive code point - sloppiness from my forebears that has made for unfortunate cleanup work in IETF along with interop issues.
_______________________________________________ BESS mailing list -- [email protected] To unsubscribe send an email to [email protected]
