Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Robin Jarry via discuss Wed, 04 Oct 2023 07:41:49 -0700

Hi Felix,

Felix Huettner, Oct 04, 2023 at 09:24:
> Hi Robin,
>
> i'll try to answer what i can.
>
> On Tue, Oct 03, 2023 at 09:22:53AM +0200, Robin Jarry via discuss wrote:
> > Hi all,
> >
> > Felix Huettner, Oct 02, 2023 at 09:35:
> > > Hi everyone,
> > >
> > > just want to add my experience below
> > > On Sun, Oct 01, 2023 at 03:49:31PM -0700, Han Zhou wrote:
> > > > On Sun, Oct 1, 2023 at 12:34 PM Robin Jarry <rja...@redhat.com> wrote:
> > > > >
> > > > > Hi Han,
> > > > >
> > > > > Please see my comments/questions inline.
> > > > >
> > > > > Han Zhou, Sep 30, 2023 at 21:59:
> > > > > > > Distributed mac learning
> > > > > > > ========================
> > > > > > >
> > > > > > > Use one OVS bridge per logical switch with mac learning
> > > > > > > enabled. Only create the bridge if the logical switch has
> > > > > > > a port bound to the local chassis.
> > > > > > >
> > > > > > > Pros:
> > > > > > >
> > > > > > > - Minimal openflow rules required in each bridge (ACLs and NAT
> > > > > > >   mostly).
> > > > > > > - No central mac binding table required.
> > > > > >
> > > > > > Firstly to clarify the terminology of "mac binding" to avoid
> > > > > > confusion, the mac_binding table currently in SB DB has nothing
> > > > > > to do with L2 MAC learning. It is actually the ARP/Neighbor
> > > > > > table of distributed logical routers. We should probably call it
> > > > > > IP_MAC_binding table, or just Neighbor table.
> > > > >
> > > > > Yes sorry about the confusion. I actually meant the FDB table.
> > > > >
> > > > > > Here what you mean is actually L2 MAC learning, which today is
> > > > > > implemented by the FDB table in SB DB, and it is only for
> > > > > > uncommon use cases when the NB doesn't have the knowledge of
> > > > > > a MAC address of a VIF.
> > > > >
> > > > > This is not that uncommon in telco use cases where VNFs can send
> > > > > packets from mac addresses unknown to OVN.
> > > > >
> > > > Understand, but VNFs contributes a very small portion of the
> > > > workloads, right? Maybe I should rephrase that: it is uncommon to
> > > > have "unknown" addresses for the majority of ports in a large scale
> > > > cloud. Is this understanding correct?
> > >
> > > I can only share numbers for our usecase with ~650 chassis we have the
> > > following distribution of "unknown" in the `addresses` field of
> > > Logical_Switch_Port:
> > > * 23000 with a mac address + ip and without "unknown"
> > > * 250 with a mac address + ip and with "unknown"
> > > * 30 with just "unknown"
> > >
> > > The usecase is a generic public cloud and we do not have any telco
> > > related things.
> >
> > I don't have any numbers from telco deployments at hand but I will poke
> > around.
> >
> > > > > > The purpose of this proposal is clear - to avoid using a central
> > > > > > table in DB for L2 information but instead using L2 MAC learning
> > > > > > to populate such information on chassis, which is a reasonable
> > > > > > alternative with pros and cons.
> > > > > > However, I don't think it is necessary to use separate OVS
> > > > > > bridges for this purpose. L2 MAC learning can be easily
> > > > > > implemented in the br-int bridge with OVS flows, which is much
> > > > > > simpler than managing dynamic number of OVS bridges just for the
> > > > > > purpose of using the builtin OVS mac-learning.
> > > > >
> > > > > I agree that this could also be implemented with VLAN tags on the
> > > > > appropriate ports. But since OVS does not support trunk ports, it
> > > > > may require complicated OF pipelines. My intent with this idea was
> > > > > two fold:
> > > > >
> > > > > 1) Avoid a central point of failure for mac learning/aging.
> > > > > 2) Simplify the OF pipeline by making all FDB operations dynamic.
> > > >
> > > > IMHO, the L2 pipeline is not really complex. It is probably the
> > > > simplest part (compared with other features for L3, NAT, ACL, LB,
> > > > etc.). Adding dynamic learning to this part probably makes it *a
> > > > little* more complex, but should still be straightforward. We don't
> > > > need any VLAN tag because the incoming packet has geneve VNI in the
> > > > metadata. We just need a flow that resubmits to lookup
> > > > a MAC-tunnelSrc mapping table, and inject a new flow (with related
> > > > tunnel endpont information) if the src MAC is not found, with the
> > > > help of the "learn" action. The entries are per-logical_switch
> > > > (VNI). This would serve your purpose of avoiding a central DB for
> > > > L2. At least this looks much simpler to me than managing dynamic
> > > > number of OVS bridges and the patch pairs between them.
> >
> > Would that work for non GENEVE networks (localnet) when there is no VNI?
> > Does that apply as well?
> >
> >
> > > >
> > > > >
> > > > > > Now back to the distributed MAC learning idea itself.
> > > > > > Essentially for two VMs/pods to communicate on L2, say,
> > > > > > VM1@Chassis1 needs to send a packet to VM2@chassis2, assuming
> > > > > > VM1 already has VM2's MAC address (we will discuss this later),
> > > > > > Chassis1 needs to know that VM2's MAC is located on Chassis2.
> > > > > >
> > > > > > In OVN today this information is conveyed through:
> > > > > >
> > > > > > - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> > > > > > - LSP and Chassis mapping (Chassis -> SB -> Chassis)
> > > > > >
> > > > > > In your proposal:
> > > > > >
> > > > > > - MAC and Chassis mapping (can be learned through initial L2
> > > > > >   broadcast/flood)
> > > > > >
> > > > > > This indeed would avoid the control plane cost through the
> > > > > > centralized components (for this L2 binding part). Given that
> > > > > > today's SB OVSDB is a bottleneck, this idea may sound
> > > > > > attractive. But please also take into consideration the below
> > > > > > improvement that could mitigate the OVN central scale issue:
> > > > > >
> > > > > > - For MAC and LSP mapping, northd is now capable of
> > > > > >   incrementally processing VIF related L2/L3 changes, so the
> > > > > >   cost of NB -> northd -> SB is very small. For SB -> Chassis,
> > > > > >   a more scalable DB deployment, such as the OVSDB relays, may
> > > > > >   largely help.
> > > > >
> > > > > But using relays will only help with read-only operations (SB ->
> > > > > chassis). Write operations (from dynamically learned mac
> > > > > addresses) will be equivalent.
> > > > >
> > > > OVSDB relay supports write operations, too. It scales better because
> > > > each ovsdb-server process handles smaller number of
> > > > clients/connections. It may still perform worse when there are too
> > > > many write operations from many clients, but I think it should scale
> > > > better than without relay. This is only based on my knowledge of the
> > > > ovsdb-server relay, but I haven't tested it at scale, yet. People
> > > > who actually deployed it may comment more.
> > >
> > > From our experience i would agree with that. I think removing the
> > > large amount of updates that needs to be send out after a write is the
> > > most helpful thing here. If you are however limited by raw write
> > > throughput and already hit that without any readers then i guess there
> > > is only the option to make this decentralized or to improve the ovsdb.
> >
> > OK thanks.
> >
> > >
> > > >
> > > > > > - For LSP and Chassis mapping, the round trip through a central
> > > > > >   DB obviously costs higher than a direct L2 broadcast (the
> > > > > >   targets are the same). But this can be optimized if the MAC
> > > > > >   and Chassis is known by the CMS system (which is true for most
> > > > > >   openstack/k8s env I believe). Instead of updating the binding
> > > > > >   from each Chassis, CMS can tell this information through the
> > > > > >   same NB -> northd -> SB -> Chassis path, and the Chassis can
> > > > > >   just read the SB without updating it.
> > > > >
> > > > > This is only applicable for known L2 addresses.  Maybe telco use
> > > > > cases are very specific, but being able to have ports that send
> > > > > packets from unknown addresses is a strong requirement.
> > > > >
> > > > Understand. But in terms of scale, the assumption is that the
> > > > majority of ports' address are known by CMS. Is this assumption
> > > > correct in telco use cases?
> >
> > I will reach out to our field teams to try and get an answer to this
> > question.
> >
> > > >
> > > > > > On the other hand, the dynamic MAC learning approach has its own
> > > > > > drawbacks.
> > > > > >
> > > > > > - It is simple to consider L2 only, but if considering more SDB
> > > > > >   features, a central DB is more flexible to extend and
> > > > > >   implement new features than a network protocol based approach.
> > > > > > - It is more predictable and easier to debug with pre-populated
> > > > > >   information through CMS than states learned dynamically in
> > > > > >   data-plane.
> > > > > > - With the DB approach we can suppress most of L2
> > > > > >   broadcast/flood, while with the distributed MAC learning
> > > > > >   broadcast/flood can't be avoided. Although it may happen
> > > > > >   mostly when a new workload is launched, it can also happen
> > > > > >   when aging. The cost of broadcast in large L2 is also
> > > > > >   a potential threat to scale.
> > > > >
> > > > > I may lack the field experience of operating large datacenter
> > > > > networks but I was not aware of any scaling issues because of ARP
> > > > > and/or other L2 broadcasts.  Is this an actual problem that was
> > > > > reported by cloud/telco operators and which influenced the
> > > > > centralized decisions?
> > > > >
> > > > I didn't hear any cloud operator reporting such problem, but I did
> > > > hear in many situation people expressed their concerns to this
> > > > problem. And if you google "ARP suppression" there are lots of
> > > > implementations by different vendors. And I believe it is a real
> > > > problem if not well managed, e.g. using an extremely large L2 domain
> > > > without ARP suppression. But I also believe it shouldn't be a big
> > > > concern if L2 segments are small.
> >
> > I was not aware that it could be a real issue at scale. I was expecting
> > ARP traffic to be completely negligible.
> >
> > >
> > > I think for large L2 domains having ARP requests being broadcasted is
> > > purely a question of network bandwidth. If they get dropped before
> > > they reach the destination they can just be retried and in normal
> > > communication the arp cache will not expire anyway.
> > >
> > > However this is different if you have some kind of L2 based failover
> > > e.g. if you move a mac address between hosts and then send out GARPs.
> > > In this case you must be certain that these GARPs have reached every
> > > single switch for it to update its FIB. Without a central state this
> > > is impossible to guarantee since packets might be randomly dropped.
> > > With a central state you need just one node to pick up the information
> > > and write it to some central store (just like that works for virtual
> > > Port_Bindings iirc).
> > >
> > > Note that this would be just a benefit of the central store which also
> > > has drawbacks that you already mentioned.
> >
> > Are you thinking about MLAG?
>
> No the central store is what we have currently with e.g. the
> Port_Binding table in the southbound db. (at least from my
> understanding).


I was referring to the "L2 based failover" example you mentioned. Or
maybe you were talking about VRRP?

>
> >
> >
> > >
> > > >
> > > > > > > Use multicast for overlay networks
> > > > > > > ==================================
> > > > > > >
> > > > > > > Use a unique 24bit VNI per overlay network. Derive a multicast
> > > > > > > group address from that VNI. Use VXLAN address learning [2] to
> > > > > > > remove the need for ovn-controller to know the destination
> > > > > > > chassis for every mac address in advance.
> > > > > > >
> > > > > > > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
> > > > > > >
> > > > > > > Pros:
> > > > > > >
> > > > > > > - Nodes do not need to know about others in advance. The
> > > > > > >   control plane load is distributed across the cluster.
> > > > > >
> > > > > > I don't think that nodes knowing each other (at node/chassis
> > > > > > level) in advance is a big scaling problem. Thinking about the
> > > > > > 10k nodes scale, it is just 10k entries on each node. And node
> > > > > > addition/removal is not a very frequent operation. So does it
> > > > > > really matter?
> > > > >
> > > > > If I'm not mistaken, with the full mesh design, scaling to 10k
> > > > > nodes implies 9999 GENEVE ports on every chassis.  Can OVS handle
> > > > > that kind of setup?  Could this have an impact on datapath
> > > > > performance?
> > > > >
> > > > I didn't test this scale myself, but according to the OVS
> > > > documentation [8] (in the LIMITS section) the limit is determined by
> > > > the file descriptor only. It is a good point regrading datapath
> > > > performance. In the early days there was a statement "Performance
> > > > will degrade beyond 1,024 ports per bridge due to fixed hash table
> > > > sizing.", but this was removed in 2019 [9]. It would be great if
> > > > someone can share a real test result at this scale (can just
> > > > simulate with enough number of tunnel ports).
> >
> > That would be interesting to get this kind of data. I will check if we
> > run such tests in our labs.
> >
> > > >
> > > > > > If I understand correctly, the major point of this approach is
> > > > > > to form the Chassis groups for BUM traffic of each L2. For
> > > > > > example, for the MAC learning to work, the initial broadcast
> > > > > > (usually ARP request) needs to be sent out to the group of
> > > > > > Chassis that is related to that specific logical L2. However, as
> > > > > > also mentioned by Felix and Frode, requiring multicast support
> > > > > > in infrastructure may exclude a lot of users.
> > > > >
> > > > > Please excuse my candid question, but was multicast traffic in the
> > > > > fabric ever raised as a problem?
> > > > >
> > > > > Most (all?) top of rack switches have IGMP/MLD support built-in.
> > > > > If that was not the case, IPv6 would not work since it requires
> > > > > multicast to function properly.
> > > > >
> > > > Having devices supporting IGMP/MLD might still be different from
> > > > willing to operate multicast. This is not my domain, so I would let
> > > > people with more operator experience comment.
> > > > Regarding IPv6, I think the basic IPv6 operations require multicast
> > > > but it can use well-defined and static multicast addresses that
> > > > don't require dynamic group management provided by MLD.
> > > > Anyway, I just wanted to provide an alternative option that may have
> > > > less requirement for infrastructure.
> > >
> > > One of my concerns for using additional features of switches is that
> > > in most cases you can not easily fix bugs in them yourself. If there
> > > is some kind of bug in OVN i have the possibility to find and fix it
> > > myself and thereby fast fix a potential outage. If i use a feature of
> > > a switch and find issues in there i am most of the time dependent on
> > > some third party that needs to find and fix the issue and then
> > > distribute the fix to me. For the normal switching and routing
> > > features this concern is also valid, but they are normally extremly
> > > widely used. Multicast features are generally significantly less used,
> > > so for me that issue would be more significant.
> >
> > That is indeed a strong point in favor of doing it all in software.
> >
> > I was expecting IGMP/MLD snooping to be a very basic feature though.
>
> I guess it might be. However a lot of deployments i have heard of run
> e.g. EVPN in their network underlay. In this case you not only need to
> support IGMP/MLD but also this in combination with EVPN.

OK, I was not aware of such deployments.

Thanks!

>
> >
> > > >
> > > > > > On the other hand, I would propose something else that can
> > > > > > achieve the same with less cost on the central SB. We can still
> > > > > > let Chassis join "multicast groups" but instead of relying on IP
> > > > > > mulitcast, we can populate this information to SB. It is
> > > > > > different from today's LSP-Chassis mapping (port_binding) in SB,
> > > > > > but a more coarse-grained mapping of Datapath-Chassis, which is
> > > > > > sufficient to support the BUM traffic for the distributed MAC
> > > > > > learning purpose and lightweight (relatively) to the central SB.
> > > > >
> > > > > But it would require the chassis to clone broadcast traffic
> > > > > explicitly in OVS.  The main benefit of using IP multicast is that
> > > > > BUM traffic duplication is handled by the fabric switches.
> > > > >
> > > > Agreed.
> > > >
> > > > > > In addition, if you still need L3 distributed routing, each node
> > > > > > not only have to join L2 groups that has workloads running
> > > > > > locally, but also needs to join indirectly connected L2 groups
> > > > > > (e.g. LS1 - LR - LS2) to receive broadcast to perform MAC
> > > > > > learning for L3 connected remotes. The "states" learned by each
> > > > > > chassis should be no different than the one achieved by
> > > > > > conditional monitoring (ovn-monitor-all=false).
> > > > > >
> > > > > > Overall, for the above two points, the primary goal is to reduce
> > > > > > dependence on the centralized control plane (especially SB DB).
> > > > > > I think it may be worth some prototype (not a small change) for
> > > > > > special use cases that require extremely large scale but simpler
> > > > > > features (and without a big concern of L2 flooding) for a good
> > > > > > tradeoff.
> > > > > >
> > > > > > I'd also like to remind that the L2 related scale issue is more
> > > > > > relevant to OpenStack, but it is not a problem for kubernetes
> > > > > > (at least not for ovn-kubernetes). ovn-kubernetes solves the
> > > > > > problem by using L3 routing instead of L2. L2 is confined within
> > > > > > each node, and between the nodes there are only routes exchanged
> > > > > > (through SB DB), which is O(N) (N = nodes) instead of O(P) (P
> > > > > > = ports). This is discussed in "Trade IP mobility for
> > > > > > scalability" (page 4 - 13) of my presentation in OVSCON2021 [7].
> > > > > >
> > > > > > Also remember that all the other features still require
> > > > > > centralized DB, including L3 routing, NAT, ACL, LB, and so on.
> > > > > > SB DB optimizations (such as using relay) may still be required
> > > > > > when scaling to 10k nodes.
> > > > >
> > > > > I agree that this is not a small change :) And for it to be worth
> > > > > it, it would probably need to go along with removing the
> > > > > southbound to push the decentralization further.
> > > > >
> > > > I'd rather consider them separate prototypes, because they are
> > > > somehow independent changes.
> > > >
> > > > > > > Connect ovn-controller to the northbound DB
> > > > > > > ===========================================
> > > > > > >
> > > > > > > This idea extends on a previous proposal to migrate the
> > > > > > > logical flows creation in ovn-controller [3].
> > > > > > >
> > > > > > > [3] 
> > > > > > > https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/
> > > > > > >
> > > > > > > If the first two proposals are implemented, the southbound
> > > > > > > database can be removed from the picture. ovn-controller can
> > > > > > > directly translate the northbound schema into OVS
> > > > > > > configuration bridges, ports and flow rules.
> > > >
> > > > I forgot to mention in my earlier reply, that I don't think "If the
> > > > first two proposals are implemented" matter here.
> > > > Firstly, the first two proposals are primarily for L2 distribution,
> > > > but it is a small part (both in code base and features) of OVN. Most
> > > > other features still rely on the central DB.
> > > > Secondly, even without the first two proposals, it is still a valid
> > > > attempt to remove SB (primarily removing the logical flow layer).
> > > > The L2 OVS flows, together with all the other flows for other
> > > > features, can still be generated by ovn-controller according to
> > > > central DB (probably a combined DB of current NB and SB).
> >
> > OK, that makes sense.
> >
> > Thanks folks!
> >
> >
> >
> > > >
> > > > > > >
> > > > > > > For other components that require access to the southbound DB
> > > > > > > (e.g. neutron metadata agent), ovn-controller should provide
> > > > > > > an interface to expose state and configuration data for local
> > > > > > > consumption.
> > > > > > >
> > > > > > > All state information present in the NB DB should be moved to
> > > > > > > a separate state database [4] for CMS consumption.
> > > > > > >
> > > > > > > [4] 
> > > > > > > https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html
> > > > > > >
> > > > > > > For those who like visuals, I have started working on basic
> > > > > > > use cases and how they would be implemented without
> > > > > > > a southbound database [5].
> > > > > > >
> > > > > > > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3
> > > > > > >
> > > > > > > Pros:
> > > > > > >
> > > > > > > - The northbound DB is smaller by design: reduced network
> > > > > > >   bandwidth and memory usage in all chassis.
> > > > > > > - If we keep the northbound read-only for ovn-controller, it
> > > > > > >   removes scaling issues when one controller updates one row
> > > > > > >   that needs to be replicated everywhere.
> > > > > > > - The northbound schema knows nothing about flows. We could
> > > > > > >   introduce alternative dataplane backends configured by
> > > > > > >   ovn-controller via plugins. I have done a minimal PoC to
> > > > > > >   check if it could work with the linux network stack [6].
> > > > > > >
> > > > > > > [6] 
> > > > > > > https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go
> > > > > > >
> > > > > > > Cons:
> > > > > > >
> > > > > > > - This would be a serious API breakage for systems that depend
> > > > > > >   on the southbound DB.
> > > > > > > - Can all OVN constructs be implemented without a southbound
> > > > > > >   DB?
> > > > > > > - Is the community interested in alternative datapaths?
> > > > > > >
> > > > > >
> > > > > > This idea was also discussed briefly in [7] (page 16-17). The
> > > > > > main motivation was to avoid the cost of the intermediate
> > > > > > logical flow layer. The above mentioned patch was abandoned
> > > > > > because it still has the logical flow translation layer but just
> > > > > > moved from northd to ovn-controller. The major benefit of
> > > > > > logical flow layer in northd is that it performs common
> > > > > > calculations that is required for every (or a lot of) chassis at
> > > > > > once, so that they don't need to be repeated on the chassis. It
> > > > > > is also very helpful for trouble-shooting. However, the logical
> > > > > > flow layer itself has a significant cost.
> > > > > >
> > > > > > There has been lots of improvement done against the cost, e.g.:
> > > > > >
> > > > > > - incremental lflow processing in ovn-controller and partially
> > > > > >   in ovn-northd
> > > > > > - offloading node-local flow generation in ovn-controller (such
> > > > > >   as port-security, LB hairpin flows, etc.)
> > > > > > - flow-tagging
> > > > > > - ...
> > > > > >
> > > > > > Since then the motivation to remove north/SB has reduced, but it
> > > > > > is still a valid alternative (with its pros and cons).
> > > > > > (I believe this change is even bigger than the distributed MAC
> > > > > > learning, but prototypes are always welcome)
> > > > >
> > > > > I must be honest here and admit that I would not know where to
> > > > > start for prototyping such a change in the OVN code base. This is
> > > > > the reason why I reached out to the community to see if my ideas
> > > > > (or at least some of them) make sense to others.
> > > > >
> > > >
> > > > It is just a big change, almost rewriting OVN. The major part of
> > > > code in ovn-northd is to generate logical flows, and a significant
> > > > part of code in ovn-controller is translating logical flow to OVS
> > > > flows. I would not be surprised if this becomes just a different
> > > > project. (while it is true that some part of ovn-controller can
> > > > still be reused, such as physical.c, chassis.c, encap.c, etc.)
> > > >
> > > > Thanks,
> > > > Han
> > > >
> > > > [8] http://www.openvswitch.org/support/dist-docs/ovs-vswitchd.8.txt
> > > > [9] 
> > > > https://github.com/openvswitch/ovs/commit/4224b9cf8fdba23fa35c1894eae42fd953a3780b
> > > >
> > > > > > Regarding the alternative datapath, I personally don't think it
> > > > > > is a strong argument here. OVN with its NB schema alone (and
> > > > > > OVSDB itself) is not an obvious advantage compared with other
> > > > > > SDN solutions. OVN exists primarily to program OVS (or any
> > > > > > OpenFlow based datapath). Logical flow table (and other SB data)
> > > > > > exists primarily for this purpose. If someone finds another
> > > > > > datapath is more attractive to their use cases than
> > > > > > OVS/OpenFlow, it is probably better to switch to its own control
> > > > > > plane (probably using a more popular/scalable database with
> > > > > > their own schema).
> > > > > >
> > > > > > Best regards,
> > > > > > Han
> > > > > >
> > > > > > [7] 
> > > > > > https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf
> > > > >
> > > > > This is a fair point and I understand your position.

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to