Hi Felix, Felix Huettner, Oct 04, 2023 at 09:24: > Hi Robin, > > i'll try to answer what i can. > > On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote: > > Hi all, > > > > Felix Huettner, Oct 02, 2023 at 09:35: > > > Hi everyone, > > > > > > just want to add my experience below > > > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote: > > > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry <rja...@redhat.com> wrote: > > > > > > > > > > Hi Han, > > > > > > > > > > Please see my comments/questions inline. > > > > > > > > > > Han Zhou, Sep 30, 2023 at 21:59: > > > > > > > Distributed mac learning > > > > > > > ======================== > > > > > > > > > > > > > > Use one OVS bridge per logical switch with mac learning > > > > > > > enabled. Only create the bridge if the logical switch has > > > > > > > a port bound to the local chassis. > > > > > > > > > > > > > > Pros: > > > > > > > > > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT > > > > > > > mostly). > > > > > > > - No central mac binding table required. > > > > > > > > > > > > Firstly to clarify the terminology of "mac binding" to avoid > > > > > > confusion, the mac_binding table currently in SB DB has nothing > > > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor > > > > > > table of distributed logical routers. We should probably call it > > > > > > IP_MAC_binding table, or just Neighbor table. > > > > > > > > > > Yes sorry about the confusion. I actually meant the FDB table. > > > > > > > > > > > Here what you mean is actually L2 MAC learning, which today is > > > > > > implemented by the FDB table in SB DB, and it is only for > > > > > > uncommon use cases when the NB doesn't have the knowledge of > > > > > > a MAC address of a VIF. > > > > > > > > > > This is not that uncommon in telco use cases where VNFs can send > > > > > packets from mac addresses unknown to OVN. > > > > > > > > > Understand, but VNFs contributes a very small portion of the > > > > workloads, right? Maybe I should rephrase that: it is uncommon to > > > > have "unknown" addresses for the majority of ports in a large scale > > > > cloud. Is this understanding correct? > > > > > > I can only share numbers for our usecase with ~650 chassis we have the > > > following distribution of "unknown" in the `addresses` field of > > > Logical_Switch_Port: > > > * 23000 with a mac address + ip and without "unknown" > > > * 250 with a mac address + ip and with "unknown" > > > * 30 with just "unknown" > > > > > > The usecase is a generic public cloud and we do not have any telco > > > related things. > > > > I don't have any numbers from telco deployments at hand but I will poke > > around. > > > > > > > > The purpose of this proposal is clear - to avoid using a central > > > > > > table in DB for L2 information but instead using L2 MAC learning > > > > > > to populate such information on chassis, which is a reasonable > > > > > > alternative with pros and cons. > > > > > > However, I don't think it is necessary to use separate OVS > > > > > > bridges for this purpose. L2 MAC learning can be easily > > > > > > implemented in the br-int bridge with OVS flows, which is much > > > > > > simpler than managing dynamic number of OVS bridges just for the > > > > > > purpose of using the builtin OVS mac-learning. > > > > > > > > > > I agree that this could also be implemented with VLAN tags on the > > > > > appropriate ports. But since OVS does not support trunk ports, it > > > > > may require complicated OF pipelines. My intent with this idea was > > > > > two fold: > > > > > > > > > > 1) Avoid a central point of failure for mac learning/aging. > > > > > 2) Simplify the OF pipeline by making all FDB operations dynamic. > > > > > > > > IMHO, the L2 pipeline is not really complex. It is probably the > > > > simplest part (compared with other features for L3, NAT, ACL, LB, > > > > etc.). Adding dynamic learning to this part probably makes it *a > > > > little* more complex, but should still be straightforward. We don't > > > > need any VLAN tag because the incoming packet has geneve VNI in the > > > > metadata. We just need a flow that resubmits to lookup > > > > a MAC-tunnelSrc mapping table, and inject a new flow (with related > > > > tunnel endpont information) if the src MAC is not found, with the > > > > help of the "learn" action. The entries are per-logical_switch > > > > (VNI). This would serve your purpose of avoiding a central DB for > > > > L2. At least this looks much simpler to me than managing dynamic > > > > number of OVS bridges and the patch pairs between them. > > > > Would that work for non GENEVE networks (localnet) when there is no VNI? > > Does that apply as well? > > > > > > > > > > > > > > > > > > > Now back to the distributed MAC learning idea itself. > > > > > > Essentially for two VMs/pods to communicate on L2, say, > > > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming > > > > > > VM1 already has VM2's MAC address (we will discuss this later), > > > > > > Chassis1 needs to know that VM2's MAC is located on Chassis2. > > > > > > > > > > > > In OVN today this information is conveyed through: > > > > > > > > > > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis) > > > > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis) > > > > > > > > > > > > In your proposal: > > > > > > > > > > > > - MAC and Chassis mapping (can be learned through initial L2 > > > > > > broadcast/flood) > > > > > > > > > > > > This indeed would avoid the control plane cost through the > > > > > > centralized components (for this L2 binding part). Given that > > > > > > today's SB OVSDB is a bottleneck, this idea may sound > > > > > > attractive. But please also take into consideration the below > > > > > > improvement that could mitigate the OVN central scale issue: > > > > > > > > > > > > - For MAC and LSP mapping, northd is now capable of > > > > > > incrementally processing VIF related L2/L3 changes, so the > > > > > > cost of NB -> northd -> SB is very small. For SB -> Chassis, > > > > > > a more scalable DB deployment, such as the OVSDB relays, may > > > > > > largely help. > > > > > > > > > > But using relays will only help with read-only operations (SB -> > > > > > chassis). Write operations (from dynamically learned mac > > > > > addresses) will be equivalent. > > > > > > > > > OVSDB relay supports write operations, too. It scales better because > > > > each ovsdb-server process handles smaller number of > > > > clients/connections. It may still perform worse when there are too > > > > many write operations from many clients, but I think it should scale > > > > better than without relay. This is only based on my knowledge of the > > > > ovsdb-server relay, but I haven't tested it at scale, yet. People > > > > who actually deployed it may comment more. > > > > > > From our experience i would agree with that. I think removing the > > > large amount of updates that needs to be send out after a write is the > > > most helpful thing here. If you are however limited by raw write > > > throughput and already hit that without any readers then i guess there > > > is only the option to make this decentralized or to improve the ovsdb. > > > > OK thanks. > > > > > > > > > > > > > > > - For LSP and Chassis mapping, the round trip through a central > > > > > > DB obviously costs higher than a direct L2 broadcast (the > > > > > > targets are the same). But this can be optimized if the MAC > > > > > > and Chassis is known by the CMS system (which is true for most > > > > > > openstack/k8s env I believe). Instead of updating the binding > > > > > > from each Chassis, CMS can tell this information through the > > > > > > same NB -> northd -> SB -> Chassis path, and the Chassis can > > > > > > just read the SB without updating it. > > > > > > > > > > This is only applicable for known L2 addresses. Maybe telco use > > > > > cases are very specific, but being able to have ports that send > > > > > packets from unknown addresses is a strong requirement. > > > > > > > > > Understand. But in terms of scale, the assumption is that the > > > > majority of ports' address are known by CMS. Is this assumption > > > > correct in telco use cases? > > > > I will reach out to our field teams to try and get an answer to this > > question. > > > > > > > > > > > > On the other hand, the dynamic MAC learning approach has its own > > > > > > drawbacks. > > > > > > > > > > > > - It is simple to consider L2 only, but if considering more SDB > > > > > > features, a central DB is more flexible to extend and > > > > > > implement new features than a network protocol based approach. > > > > > > - It is more predictable and easier to debug with pre-populated > > > > > > information through CMS than states learned dynamically in > > > > > > data-plane. > > > > > > - With the DB approach we can suppress most of L2 > > > > > > broadcast/flood, while with the distributed MAC learning > > > > > > broadcast/flood can't be avoided. Although it may happen > > > > > > mostly when a new workload is launched, it can also happen > > > > > > when aging. The cost of broadcast in large L2 is also > > > > > > a potential threat to scale. > > > > > > > > > > I may lack the field experience of operating large datacenter > > > > > networks but I was not aware of any scaling issues because of ARP > > > > > and/or other L2 broadcasts. Is this an actual problem that was > > > > > reported by cloud/telco operators and which influenced the > > > > > centralized decisions? > > > > > > > > > I didn't hear any cloud operator reporting such problem, but I did > > > > hear in many situation people expressed their concerns to this > > > > problem. And if you google "ARP suppression" there are lots of > > > > implementations by different vendors. And I believe it is a real > > > > problem if not well managed, e.g. using an extremely large L2 domain > > > > without ARP suppression. But I also believe it shouldn't be a big > > > > concern if L2 segments are small. > > > > I was not aware that it could be a real issue at scale. I was expecting > > ARP traffic to be completely negligible. > > > > > > > > I think for large L2 domains having ARP requests being broadcasted is > > > purely a question of network bandwidth. If they get dropped before > > > they reach the destination they can just be retried and in normal > > > communication the arp cache will not expire anyway. > > > > > > However this is different if you have some kind of L2 based failover > > > e.g. if you move a mac address between hosts and then send out GARPs. > > > In this case you must be certain that these GARPs have reached every > > > single switch for it to update its FIB. Without a central state this > > > is impossible to guarantee since packets might be randomly dropped. > > > With a central state you need just one node to pick up the information > > > and write it to some central store (just like that works for virtual > > > Port_Bindings iirc). > > > > > > Note that this would be just a benefit of the central store which also > > > has drawbacks that you already mentioned. > > > > Are you thinking about MLAG? > > No the central store is what we have currently with e.g. the > Port_Binding table in the southbound db. (at least from my > understanding).
I was referring to the "L2 based failover" example you mentioned. Or maybe you were talking about VRRP? > > > > > > > > > > > > > > > > > > > Use multicast for overlay networks > > > > > > > ================================== > > > > > > > > > > > > > > Use a unique 24bit VNI per overlay network. Derive a multicast > > > > > > > group address from that VNI. Use VXLAN address learning [2] to > > > > > > > remove the need for ovn-controller to know the destination > > > > > > > chassis for every mac address in advance. > > > > > > > > > > > > > > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2 > > > > > > > > > > > > > > Pros: > > > > > > > > > > > > > > - Nodes do not need to know about others in advance. The > > > > > > > control plane load is distributed across the cluster. > > > > > > > > > > > > I don't think that nodes knowing each other (at node/chassis > > > > > > level) in advance is a big scaling problem. Thinking about the > > > > > > 10k nodes scale, it is just 10k entries on each node. And node > > > > > > addition/removal is not a very frequent operation. So does it > > > > > > really matter? > > > > > > > > > > If I'm not mistaken, with the full mesh design, scaling to 10k > > > > > nodes implies 9999 GENEVE ports on every chassis. Can OVS handle > > > > > that kind of setup? Could this have an impact on datapath > > > > > performance? > > > > > > > > > I didn't test this scale myself, but according to the OVS > > > > documentation [8] (in the LIMITS section) the limit is determined by > > > > the file descriptor only. It is a good point regrading datapath > > > > performance. In the early days there was a statement "Performance > > > > will degrade beyond 1,024 ports per bridge due to fixed hash table > > > > sizing.", but this was removed in 2019 [9]. It would be great if > > > > someone can share a real test result at this scale (can just > > > > simulate with enough number of tunnel ports). > > > > That would be interesting to get this kind of data. I will check if we > > run such tests in our labs. > > > > > > > > > > > > If I understand correctly, the major point of this approach is > > > > > > to form the Chassis groups for BUM traffic of each L2. For > > > > > > example, for the MAC learning to work, the initial broadcast > > > > > > (usually ARP request) needs to be sent out to the group of > > > > > > Chassis that is related to that specific logical L2. However, as > > > > > > also mentioned by Felix and Frode, requiring multicast support > > > > > > in infrastructure may exclude a lot of users. > > > > > > > > > > Please excuse my candid question, but was multicast traffic in the > > > > > fabric ever raised as a problem? > > > > > > > > > > Most (all?) top of rack switches have IGMP/MLD support built-in. > > > > > If that was not the case, IPv6 would not work since it requires > > > > > multicast to function properly. > > > > > > > > > Having devices supporting IGMP/MLD might still be different from > > > > willing to operate multicast. This is not my domain, so I would let > > > > people with more operator experience comment. > > > > Regarding IPv6, I think the basic IPv6 operations require multicast > > > > but it can use well-defined and static multicast addresses that > > > > don't require dynamic group management provided by MLD. > > > > Anyway, I just wanted to provide an alternative option that may have > > > > less requirement for infrastructure. > > > > > > One of my concerns for using additional features of switches is that > > > in most cases you can not easily fix bugs in them yourself. If there > > > is some kind of bug in OVN i have the possibility to find and fix it > > > myself and thereby fast fix a potential outage. If i use a feature of > > > a switch and find issues in there i am most of the time dependent on > > > some third party that needs to find and fix the issue and then > > > distribute the fix to me. For the normal switching and routing > > > features this concern is also valid, but they are normally extremly > > > widely used. Multicast features are generally significantly less used, > > > so for me that issue would be more significant. > > > > That is indeed a strong point in favor of doing it all in software. > > > > I was expecting IGMP/MLD snooping to be a very basic feature though. > > I guess it might be. However a lot of deployments i have heard of run > e.g. EVPN in their network underlay. In this case you not only need to > support IGMP/MLD but also this in combination with EVPN. OK, I was not aware of such deployments. Thanks! > > > > > > > > > > > > > On the other hand, I would propose something else that can > > > > > > achieve the same with less cost on the central SB. We can still > > > > > > let Chassis join "multicast groups" but instead of relying on IP > > > > > > mulitcast, we can populate this information to SB. It is > > > > > > different from today's LSP-Chassis mapping (port_binding) in SB, > > > > > > but a more coarse-grained mapping of Datapath-Chassis, which is > > > > > > sufficient to support the BUM traffic for the distributed MAC > > > > > > learning purpose and lightweight (relatively) to the central SB. > > > > > > > > > > But it would require the chassis to clone broadcast traffic > > > > > explicitly in OVS. The main benefit of using IP multicast is that > > > > > BUM traffic duplication is handled by the fabric switches. > > > > > > > > > Agreed. > > > > > > > > > > In addition, if you still need L3 distributed routing, each node > > > > > > not only have to join L2 groups that has workloads running > > > > > > locally, but also needs to join indirectly connected L2 groups > > > > > > (e.g. LS1 - LR - LS2) to receive broadcast to perform MAC > > > > > > learning for L3 connected remotes. The "states" learned by each > > > > > > chassis should be no different than the one achieved by > > > > > > conditional monitoring (ovn-monitor-all=false). > > > > > > > > > > > > Overall, for the above two points, the primary goal is to reduce > > > > > > dependence on the centralized control plane (especially SB DB). > > > > > > I think it may be worth some prototype (not a small change) for > > > > > > special use cases that require extremely large scale but simpler > > > > > > features (and without a big concern of L2 flooding) for a good > > > > > > tradeoff. > > > > > > > > > > > > I'd also like to remind that the L2 related scale issue is more > > > > > > relevant to OpenStack, but it is not a problem for kubernetes > > > > > > (at least not for ovn-kubernetes). ovn-kubernetes solves the > > > > > > problem by using L3 routing instead of L2. L2 is confined within > > > > > > each node, and between the nodes there are only routes exchanged > > > > > > (through SB DB), which is O(N) (N = nodes) instead of O(P) (P > > > > > > = ports). This is discussed in "Trade IP mobility for > > > > > > scalability" (page 4 - 13) of my presentation in OVSCON2021 [7]. > > > > > > > > > > > > Also remember that all the other features still require > > > > > > centralized DB, including L3 routing, NAT, ACL, LB, and so on. > > > > > > SB DB optimizations (such as using relay) may still be required > > > > > > when scaling to 10k nodes. > > > > > > > > > > I agree that this is not a small change :) And for it to be worth > > > > > it, it would probably need to go along with removing the > > > > > southbound to push the decentralization further. > > > > > > > > > I'd rather consider them separate prototypes, because they are > > > > somehow independent changes. > > > > > > > > > > > Connect ovn-controller to the northbound DB > > > > > > > =========================================== > > > > > > > > > > > > > > This idea extends on a previous proposal to migrate the > > > > > > > logical flows creation in ovn-controller [3]. > > > > > > > > > > > > > > [3] > > > > > > > https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/ > > > > > > > > > > > > > > If the first two proposals are implemented, the southbound > > > > > > > database can be removed from the picture. ovn-controller can > > > > > > > directly translate the northbound schema into OVS > > > > > > > configuration bridges, ports and flow rules. > > > > > > > > I forgot to mention in my earlier reply, that I don't think "If the > > > > first two proposals are implemented" matter here. > > > > Firstly, the first two proposals are primarily for L2 distribution, > > > > but it is a small part (both in code base and features) of OVN. Most > > > > other features still rely on the central DB. > > > > Secondly, even without the first two proposals, it is still a valid > > > > attempt to remove SB (primarily removing the logical flow layer). > > > > The L2 OVS flows, together with all the other flows for other > > > > features, can still be generated by ovn-controller according to > > > > central DB (probably a combined DB of current NB and SB). > > > > OK, that makes sense. > > > > Thanks folks! > > > > > > > > > > > > > > > > > > > > > > > > For other components that require access to the southbound DB > > > > > > > (e.g. neutron metadata agent), ovn-controller should provide > > > > > > > an interface to expose state and configuration data for local > > > > > > > consumption. > > > > > > > > > > > > > > All state information present in the NB DB should be moved to > > > > > > > a separate state database [4] for CMS consumption. > > > > > > > > > > > > > > [4] > > > > > > > https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html > > > > > > > > > > > > > > For those who like visuals, I have started working on basic > > > > > > > use cases and how they would be implemented without > > > > > > > a southbound database [5]. > > > > > > > > > > > > > > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3 > > > > > > > > > > > > > > Pros: > > > > > > > > > > > > > > - The northbound DB is smaller by design: reduced network > > > > > > > bandwidth and memory usage in all chassis. > > > > > > > - If we keep the northbound read-only for ovn-controller, it > > > > > > > removes scaling issues when one controller updates one row > > > > > > > that needs to be replicated everywhere. > > > > > > > - The northbound schema knows nothing about flows. We could > > > > > > > introduce alternative dataplane backends configured by > > > > > > > ovn-controller via plugins. I have done a minimal PoC to > > > > > > > check if it could work with the linux network stack [6]. > > > > > > > > > > > > > > [6] > > > > > > > https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go > > > > > > > > > > > > > > Cons: > > > > > > > > > > > > > > - This would be a serious API breakage for systems that depend > > > > > > > on the southbound DB. > > > > > > > - Can all OVN constructs be implemented without a southbound > > > > > > > DB? > > > > > > > - Is the community interested in alternative datapaths? > > > > > > > > > > > > > > > > > > > This idea was also discussed briefly in [7] (page 16-17). The > > > > > > main motivation was to avoid the cost of the intermediate > > > > > > logical flow layer. The above mentioned patch was abandoned > > > > > > because it still has the logical flow translation layer but just > > > > > > moved from northd to ovn-controller. The major benefit of > > > > > > logical flow layer in northd is that it performs common > > > > > > calculations that is required for every (or a lot of) chassis at > > > > > > once, so that they don't need to be repeated on the chassis. It > > > > > > is also very helpful for trouble-shooting. However, the logical > > > > > > flow layer itself has a significant cost. > > > > > > > > > > > > There has been lots of improvement done against the cost, e.g.: > > > > > > > > > > > > - incremental lflow processing in ovn-controller and partially > > > > > > in ovn-northd > > > > > > - offloading node-local flow generation in ovn-controller (such > > > > > > as port-security, LB hairpin flows, etc.) > > > > > > - flow-tagging > > > > > > - ... > > > > > > > > > > > > Since then the motivation to remove north/SB has reduced, but it > > > > > > is still a valid alternative (with its pros and cons). > > > > > > (I believe this change is even bigger than the distributed MAC > > > > > > learning, but prototypes are always welcome) > > > > > > > > > > I must be honest here and admit that I would not know where to > > > > > start for prototyping such a change in the OVN code base. This is > > > > > the reason why I reached out to the community to see if my ideas > > > > > (or at least some of them) make sense to others. > > > > > > > > > > > > > It is just a big change, almost rewriting OVN. The major part of > > > > code in ovn-northd is to generate logical flows, and a significant > > > > part of code in ovn-controller is translating logical flow to OVS > > > > flows. I would not be surprised if this becomes just a different > > > > project. (while it is true that some part of ovn-controller can > > > > still be reused, such as physical.c, chassis.c, encap.c, etc.) > > > > > > > > Thanks, > > > > Han > > > > > > > > [8] http://www.openvswitch.org/support/dist-docs/ovs-vswitchd.8.txt > > > > [9] > > > > https://github.com/openvswitch/ovs/commit/4224b9cf8fdba23fa35c1894eae42fd953a3780b > > > > > > > > > > Regarding the alternative datapath, I personally don't think it > > > > > > is a strong argument here. OVN with its NB schema alone (and > > > > > > OVSDB itself) is not an obvious advantage compared with other > > > > > > SDN solutions. OVN exists primarily to program OVS (or any > > > > > > OpenFlow based datapath). Logical flow table (and other SB data) > > > > > > exists primarily for this purpose. If someone finds another > > > > > > datapath is more attractive to their use cases than > > > > > > OVS/OpenFlow, it is probably better to switch to its own control > > > > > > plane (probably using a more popular/scalable database with > > > > > > their own schema). > > > > > > > > > > > > Best regards, > > > > > > Han > > > > > > > > > > > > [7] > > > > > > https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf > > > > > > > > > > This is a fair point and I understand your position. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss