Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Robin Jarry via discuss Fri, 29 Sep 2023 08:15:02 -0700

Felix Huettner, Sep 29, 2023 at 15:23:
> > Distributed mac learning
> > ========================
[snip]
> >
> > Cons:
> >
> > - How to manage seamless upgrades?
> > - Requires ovn-controller to move/plug ports in the correct bridge.
> > - Multiple openflow connections (one per managed bridge).
> > - Requires ovn-trace to be reimplemented differently (maybe other tools
> >   as well).
>
> - No central information anymore on mac bindings. All nodes need to
>   update their data individually
> - Each bridge generates also a linux network interface. I do not know if
>   there is some kind of limit to the linux interfaces or the ovs bridges
>   somewhere.


That's a good point. However, only the bridges related to one
implemented logical network would need to be created on a single
chassis. Even with the largest OVN deployments, I doubt this would be
a limitation.

> Would you still preprovision static mac addresses on the bridge for all
> port_bindings we know the mac address from, or would you rather leave
> that up for learning as well?

I would leave everything dynamic.

> I do not know if there is some kind of performance/optimization penality
> for moving packets between different bridges.

As far as I know, once the openflow pipeline has been resolved into
a datapath flow, there is no penalty.

> You can also not only use the logical switch that have a local port
> bound. Assume the following topology:
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> vm1 and vm2 are both running on the same hypervisor. Creating only local
> logical switches would mean only ls1 and ls3 are available on that
> hypervisor. This would break the connection between the two vms which
> would in the current implementation just traverse the two logical
> routers.
> I guess we would need to create bridges for each locally reachable
> logical switch. I am concerned about the potentially significant
> increase in bridges and openflow connections this brings.

That is one of the concerns I raised in the last point. In my opinion
this is a trade off. You remove centralization and require more local
processing. But overall, the processing cost should remain equivalent.

> > Use multicast for overlay networks
> > ==================================
[snip]
> > - 24bit VNI allows for more than 16 million logical switches. No need
> >   for extended GENEVE tunnel options.
> Note that using vxlan at the moment significantly reduces the ovn
> featureset. This is because the geneve header options are currently used
> for data that would not fit into the vxlan vni.
>
> From ovn-architecture.7.xml:
> ```
> The maximum number of networks is reduced to 4096.
> The maximum number of ports per network is reduced to 2048.
> ACLs matching against logical ingress port identifiers are not supported.
> OVN interconnection feature is not supported.
> ```

In my understanding, the main reason why GENEVE replaced VXLAN is
because Openstack uses full mesh point to point tunnels and that the
sender needs to know behind which chassis any mac address is to send it
into the correct tunnel. GENEVE allowed to reduce the lookup time both
on the sender and receiver thanks to ingress/egress port metadata.

https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/

If VXLAN + multicast and address learning was used, the "correct" tunnel
would be established ad-hoc and both sender and receiver lookups would
only be a simple mac forwarding with learning. The ingress pipeline
would probably cost a little more.

Maybe multicast + address learning could be implemented for GENEVE as
well. But it would not be interoperable with other VTEPs.

> > - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> >   top-of-rack switches. Multicast is only used for BUM traffic.
> > - Only one VXLAN output port per implemented logical switch on a given
> >   chassis.
>
> Would this actually work with one VXLAN output port? Would you not need
> one port per target node to send unicast traffic (as you otherwise flood
> all packets to all participating nodes)?

You would need one VXLAN output port per implemented logical switch on
a given chassis. The port would have a VNI (unique per logical switch)
and an associated multicast IP address. Any chassis that implement this
logical switch would subscribe to that multicast group. The flooding
would be limited to first packets and broadcast/multicast traffic (ARP
requests, mostly). Once the receiver node replies, all communication
will happen with unicast.

https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic

> > Cons:
> >
> > - OVS does not support VXLAN address learning yet.
> > - The number of usable multicast groups in a fabric network may be
> >   limited?
> > - How to manage seamless upgrades and interoperability with older OVN
> >   versions?
> - This pushes all logic related to chassis management to the
>   underlying networking fabric. It thereby places additional
>   requirements on the network fabric that have not been here before and
>   that might not be available for all users.

Are you aware of any fabric that does not support IGMP/MLD snooping?

> - The bfd sessions between chassis are no longer possible thereby
>   preventing fast failover of gateway chassis.

I don't know what these BFD sessions are used for. But we could imagine
an ad-hoc establishment of them when a tunnel is created.

> As this idea requires VXLAN and all current limitation would apply to
> this solution as well this is probably no general solution but rather a
> deployment option.

Yes, for backward compatibility, it would probably need to be opt-in.

> > Connect ovn-controller to the northbound DB
> > ===========================================
[snip]
> > For other components that require access to the southbound DB (e.g.
> > neutron metadata agent), ovn-controller should provide an interface to
> > expose state and configuration data for local consumption.
>
> Note that also ovn-interconnect uses access to the southbound DB to add
> chassis of the interconnected site (and potentially some more magic).

I was not aware of this. Thanks for the heads up.

> > Pros:
[snip]
>
> - one less codebase with northd gone
>
> > Cons:
> >
> > - This would be a serious API breakage for systems that depend on the
> >   southbound DB.
> > - Can all OVN constructs be implemented without a southbound DB?
> > - Is the community interested in alternative datapaths?
>
> - It requires each ovn-controller to do that translation of a given
>   construct (e.g. a logical switch) thereby probably increasing the cpu
>   load and recompute time

We cannot get this for free :) The CPU load that is gone from the
central node needs to be shared across all chassis.

> - The complexity of the ovn-controller grows as it gains nearly all
>   logic of northd

Agreed, but the complexity may not be that high. Since ovn-controller
would not need to do a two staged translation from NB model to logical
flows to openflow.

Also, if we reuse OVS bridges to implement logical switches. There would
be a reduced number of flows to compute.

> I now understand what you meant with the alternative datapaths in your
> first mail. While i find the option interesting i'm not sure how much
> value actually would come out of that.

I am having resource constrained environments in mind (DPUs/IPUs, edge
nodes, etc). When available memory and CPU are limited, OVS may not the
most efficient. Maybe using the plain linux (or BSD) networking stack
would be perfectly suitable and more lightweight.

Also, since the northbound database doesn't know anything about flows,
it could make OVN interoperable with any network capable element that is
able to implement the network intent as described in the NB DB (<insert
the name of your vrouter here>, etc.).

> For me it feels like this would make ovn siginificantly harder to debug.

Are you talking about ovn-trace? Or in general?

Thanks for your comments.

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to