Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Robin Jarry via discuss Sun, 01 Oct 2023 12:35:00 -0700

Hi Han,

Please see my comments/questions inline.


Han Zhou, Sep 30, 2023 at 21:59:
> > Distributed mac learning
> > ========================
> >
> > Use one OVS bridge per logical switch with mac learning enabled. Only
> > create the bridge if the logical switch has a port bound to the local
> > chassis.
> >
> > Pros:
> >
> > - Minimal openflow rules required in each bridge (ACLs and NAT mostly).
> > - No central mac binding table required.
>
> Firstly to clarify the terminology of "mac binding" to avoid confusion, the
> mac_binding table currently in SB DB has nothing to do with L2 MAC
> learning. It is actually the ARP/Neighbor table of distributed logical
> routers. We should probably call it IP_MAC_binding table, or just Neighbor
> table.

Yes sorry about the confusion. I actually meant the FDB table.

> Here what you mean is actually L2 MAC learning, which today is implemented
> by the FDB table in SB DB, and it is only for uncommon use cases when the
> NB doesn't have the knowledge of a MAC address of a VIF.

This is not that uncommon in telco use cases where VNFs can send packets
from mac addresses unknown to OVN.

> The purpose of this proposal is clear - to avoid using a central table in
> DB for L2 information but instead using L2 MAC learning to populate such
> information on chassis, which is a reasonable alternative with pros and
> cons.
> However, I don't think it is necessary to use separate OVS bridges for this
> purpose. L2 MAC learning can be easily implemented in the br-int bridge
> with OVS flows, which is much simpler than managing dynamic number of OVS
> bridges just for the purpose of using the builtin OVS mac-learning.

I agree that this could also be implemented with VLAN tags on the
appropriate ports. But since OVS does not support trunk ports, it may
require complicated OF pipelines. My intent with this idea was two fold:

1) Avoid a central point of failure for mac learning/aging.
2) Simplify the OF pipeline by making all FDB operations dynamic.

> Now back to the distributed MAC learning idea itself. Essentially for two
> VMs/pods to communicate on L2, say, VM1@Chassis1 needs to send a packet to
> VM2@chassis2, assuming VM1 already has VM2's MAC address (we will discuss
> this later), Chassis1 needs to know that VM2's MAC is located on Chassis2.
>
> In OVN today this information is conveyed through:
>
> - MAC and LSP mapping (NB -> northd -> SB -> Chassis)
> - LSP and Chassis mapping (Chassis -> SB -> Chassis)
>
> In your proposal:
>
> - MAC and Chassis mapping (can be learned through initial L2
>   broadcast/flood)
>
> This indeed would avoid the control plane cost through the centralized
> components (for this L2 binding part). Given that today's SB OVSDB is a
> bottleneck, this idea may sound attractive. But please also take into
> consideration the below improvement that could mitigate the OVN central
> scale issue:
>
> - For MAC and LSP mapping, northd is now capable of incrementally
>   processing VIF related L2/L3 changes, so the cost of NB -> northd ->
>   SB is very small. For SB -> Chassis, a more scalable DB deployment,
>   such as the OVSDB relays, may largely help.

But using relays will only help with read-only operations (SB ->
chassis). Write operations (from dynamically learned mac addresses) will
be equivalent.

> - For LSP and Chassis mapping, the round trip through a central DB
>   obviously costs higher than a direct L2 broadcast (the targets are
>   the same). But this can be optimized if the MAC and Chassis is known
>   by the CMS system (which is true for most openstack/k8s env
>   I believe). Instead of updating the binding from each Chassis, CMS
>   can tell this information through the same NB -> northd -> SB ->
>   Chassis path, and the Chassis can just read the SB without updating
>   it.

This is only applicable for known L2 addresses.  Maybe telco use cases
are very specific, but being able to have ports that send packets from
unknown addresses is a strong requirement.

> On the other hand, the dynamic MAC learning approach has its own drawbacks.
>
> - It is simple to consider L2 only, but if considering more SDB
>   features, a central DB is more flexible to extend and implement new
>   features than a network protocol based approach.
> - It is more predictable and easier to debug with pre-populated
>   information through CMS than states learned dynamically in
>   data-plane.
> - With the DB approach we can suppress most of L2 broadcast/flood,
>   while with the distributed MAC learning broadcast/flood can't be
>   avoided. Although it may happen mostly when a new workload is
>   launched, it can also happen when aging. The cost of broadcast in
>   large L2 is also a potential threat to scale.

I may lack the field experience of operating large datacenter networks
but I was not aware of any scaling issues because of ARP and/or other L2
broadcasts.  Is this an actual problem that was reported by cloud/telco
operators and which influenced the centralized decisions?

> > Use multicast for overlay networks
> > ==================================
> >
> > Use a unique 24bit VNI per overlay network. Derive a multicast group
> > address from that VNI. Use VXLAN address learning [2] to remove the need
> > for ovn-controller to know the destination chassis for every mac address
> > in advance.
> >
> > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2
> >
> > Pros:
> >
> > - Nodes do not need to know about others in advance. The control plane
> >   load is distributed across the cluster.
>
> I don't think that nodes knowing each other (at node/chassis level) in
> advance is a big scaling problem. Thinking about the 10k nodes scale, it is
> just 10k entries on each node. And node addition/removal is not a very
> frequent operation. So does it really matter?

If I'm not mistaken, with the full mesh design, scaling to 10k nodes
implies 9999 GENEVE ports on every chassis.  Can OVS handle that kind of
setup?  Could this have an impact on datapath performance?

> If I understand correctly, the major point of this approach is to form the
> Chassis groups for BUM traffic of each L2. For example, for the MAC
> learning to work, the initial broadcast (usually ARP request) needs to be
> sent out to the group of Chassis that is related to that specific logical
> L2. However, as also mentioned by Felix and Frode, requiring multicast
> support in infrastructure may exclude a lot of users.

Please excuse my candid question, but was multicast traffic in the
fabric ever raised as a problem?

Most (all?) top of rack switches have IGMP/MLD support built-in. If that
was not the case, IPv6 would not work since it requires multicast to
function properly.

> On the other hand, I would propose something else that can achieve the same
> with less cost on the central SB. We can still let Chassis join "multicast
> groups" but instead of relying on IP mulitcast, we can populate this
> information to SB. It is different from today's LSP-Chassis mapping
> (port_binding) in SB, but a more coarse-grained mapping of
> Datapath-Chassis, which is sufficient to support the BUM traffic for the
> distributed MAC learning purpose and lightweight (relatively) to the
> central SB.

But it would require the chassis to clone broadcast traffic explicitly
in OVS.  The main benefit of using IP multicast is that BUM traffic
duplication is handled by the fabric switches.

> In addition, if you still need L3 distributed routing, each node not only
> have to join L2 groups that has workloads running locally, but also needs
> to join indirectly connected L2 groups (e.g. LS1 - LR - LS2) to receive
> broadcast to perform MAC learning for L3 connected remotes. The "states"
> learned by each chassis should be no different than the one achieved by
> conditional monitoring (ovn-monitor-all=false).
>
> Overall, for the above two points, the primary goal is to reduce dependence
> on the centralized control plane (especially SB DB). I think it may be
> worth some prototype (not a small change) for special use cases that
> require extremely large scale but simpler features (and without a big
> concern of L2 flooding) for a good tradeoff.
>
> I'd also like to remind that the L2 related scale issue is more relevant to
> OpenStack, but it is not a problem for kubernetes (at least not for
> ovn-kubernetes). ovn-kubernetes solves the problem by using L3 routing
> instead of L2. L2 is confined within each node, and between the nodes there
> are only routes exchanged (through SB DB), which is O(N) (N = nodes)
> instead of O(P) (P = ports). This is discussed in "Trade IP mobility for
> scalability" (page 4 - 13) of my presentation in OVSCON2021 [7].
>
> Also remember that all the other features still require centralized DB,
> including L3 routing, NAT, ACL, LB, and so on. SB DB optimizations (such as
> using relay) may still be required when scaling to 10k nodes.

I agree that this is not a small change :) And for it to be worth it, it
would probably need to go along with removing the southbound to push the
decentralization further.

> > Connect ovn-controller to the northbound DB
> > ===========================================
> >
> > This idea extends on a previous proposal to migrate the logical flows
> > creation in ovn-controller [3].
> >
> > [3] 
> > https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-num...@ovn.org/
> >
> > If the first two proposals are implemented, the southbound database can
> > be removed from the picture. ovn-controller can directly translate the
> > northbound schema into OVS configuration bridges, ports and flow rules.
> >
> > For other components that require access to the southbound DB (e.g.
> > neutron metadata agent), ovn-controller should provide an interface to
> > expose state and configuration data for local consumption.
> >
> > All state information present in the NB DB should be moved to a separate
> > state database [4] for CMS consumption.
> >
> > [4] https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html
> >
> > For those who like visuals, I have started working on basic use cases
> > and how they would be implemented without a southbound database [5].
> >
> > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3
> >
> > Pros:
> >
> > - The northbound DB is smaller by design: reduced network bandwidth and
> >   memory usage in all chassis.
> > - If we keep the northbound read-only for ovn-controller, it removes
> >   scaling issues when one controller updates one row that needs to be
> >   replicated everywhere.
> > - The northbound schema knows nothing about flows. We could introduce
> >   alternative dataplane backends configured by ovn-controller via
> >   plugins. I have done a minimal PoC to check if it could work with the
> >   linux network stack [6].
> >
> > [6] https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go
> >
> > Cons:
> >
> > - This would be a serious API breakage for systems that depend on the
> >   southbound DB.
> > - Can all OVN constructs be implemented without a southbound DB?
> > - Is the community interested in alternative datapaths?
> >
>
> This idea was also discussed briefly in [7] (page 16-17). The main
> motivation was to avoid the cost of the intermediate logical flow layer.
> The above mentioned patch was abandoned because it still has the logical
> flow translation layer but just moved from northd to ovn-controller.
> The major benefit of logical flow layer in northd is that it performs
> common calculations that is required for every (or a lot of) chassis at
> once, so that they don't need to be repeated on the chassis. It is also
> very helpful for trouble-shooting. However, the logical flow layer itself
> has a significant cost.
>
> There has been lots of improvement done against the cost, e.g.:
>
> - incremental lflow processing in ovn-controller and partially in ovn-northd
> - offloading node-local flow generation in ovn-controller (such as
>   port-security, LB hairpin flows, etc.)
> - flow-tagging
> - ...
>
> Since then the motivation to remove north/SB has reduced, but it is still a
> valid alternative (with its pros and cons).
> (I believe this change is even bigger than the distributed MAC learning,
> but prototypes are always welcome)

I must be honest here and admit that I would not know where to start for
prototyping such a change in the OVN code base. This is the reason why
I reached out to the community to see if my ideas (or at least some of
them) make sense to others.

> Regarding the alternative datapath, I personally don't think it is a strong
> argument here. OVN with its NB schema alone (and OVSDB itself) is not an
> obvious advantage compared with other SDN solutions. OVN exists primarily
> to program OVS (or any OpenFlow based datapath). Logical flow table (and
> other SB data) exists primarily for this purpose. If someone finds another
> datapath is more attractive to their use cases than OVS/OpenFlow, it is
> probably better to switch to its own control plane (probably using a more
> popular/scalable database with their own schema).
>
> Best regards,
> Han
>
> [7] https://www.openvswitch.org/support/ovscon2021/slides/scale_ovn.pdf

This is a fair point and I understand your position.

Thanks for taking the time to consider my ideas!

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to