Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Robin Jarry via discuss Sat, 30 Sep 2023 06:50:50 -0700

Hi Vladislav, Frode,

Thanks for your replies.


Frode Nordahl, Sep 30, 2023 at 10:55:
> On Sat, Sep 30, 2023 at 9:43 AM Vladislav Odintsov via discuss
> <ovs-discuss@openvswitch.org> wrote:
> > > On 29 Sep 2023, at 18:14, Robin Jarry via discuss 
> > > <ovs-discuss@openvswitch.org> wrote:
> > >
> > > Felix Huettner, Sep 29, 2023 at 15:23:
> > >>> Distributed mac learning
> > >>> ========================
> > > [snip]
> > >>>
> > >>> Cons:
> > >>>
> > >>> - How to manage seamless upgrades?
> > >>> - Requires ovn-controller to move/plug ports in the correct bridge.
> > >>> - Multiple openflow connections (one per managed bridge).
> > >>> - Requires ovn-trace to be reimplemented differently (maybe other tools
> > >>>  as well).
> > >>
> > >> - No central information anymore on mac bindings. All nodes need to
> > >>  update their data individually
> > >> - Each bridge generates also a linux network interface. I do not know if
> > >>  there is some kind of limit to the linux interfaces or the ovs bridges
> > >>  somewhere.
> > >
> > > That's a good point. However, only the bridges related to one
> > > implemented logical network would need to be created on a single
> > > chassis. Even with the largest OVN deployments, I doubt this would be
> > > a limitation.
> > >
> > >> Would you still preprovision static mac addresses on the bridge for all
> > >> port_bindings we know the mac address from, or would you rather leave
> > >> that up for learning as well?
> > >
> > > I would leave everything dynamic.
> > >
> > >> I do not know if there is some kind of performance/optimization penality
> > >> for moving packets between different bridges.
> > >
> > > As far as I know, once the openflow pipeline has been resolved into
> > > a datapath flow, there is no penalty.
> > >
> > >> You can also not only use the logical switch that have a local port
> > >> bound. Assume the following topology:
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2|
> > >> +---+ +---+ +---+ +---+ +---+ +---+ +---+
> > >> vm1 and vm2 are both running on the same hypervisor. Creating only local
> > >> logical switches would mean only ls1 and ls3 are available on that
> > >> hypervisor. This would break the connection between the two vms which
> > >> would in the current implementation just traverse the two logical
> > >> routers.
> > >> I guess we would need to create bridges for each locally reachable
> > >> logical switch. I am concerned about the potentially significant
> > >> increase in bridges and openflow connections this brings.
> > >
> > > That is one of the concerns I raised in the last point. In my opinion
> > > this is a trade off. You remove centralization and require more local
> > > processing. But overall, the processing cost should remain equivalent.
> >
> > Just want to clarify.
> >
> > For topology described by Felix above, you propose to create 2 OVS
> > bridges, right? How will the packet traverse from vm1 to vm2?

In this particular case, there would be 3 OVS bridges, one for each
logical switch.

> > Currently when the packet enters OVS all the logical switching and
> > routing openflow calculation is done with no packet re-entering OVS,
> > and this results in one DP flow match to deliver this packet from
> > vm1 to vm2 (if no conntrack used, which could introduce
> > recirculations).
> >
> > Do I understand correctly, that in this proposal OVS needs to
> > receive packet from “ls1” bridge, next run through lrouter “lr1”
> > OpenFlow pipelines, then output packet to “ls2” OVS bridge for mac
> > learning between logical routers (should we have here OF flow with
> > learn action?), then send packet again to OVS, calculate “lr2”
> > OpenFlow pipeline and finally reach destination OVS bridge “ls3” to
> > send packet to a vm2?

What I am proposing is to implement the northbound L2 network intent
with actual OVS bridges and builtin OVS mac learning. The L3/L4 network
constructs and ACLs would require patch ports and specific OF pipelines.

We could even think of adding more advanced L3 capabilities (RIB) into
OVS to simplify the OF pipelines.

> > Also, will such behavior be compatible with HW-offload-capable to
> > smartnics/DPUs?
>
> I am also a bit concerned about this, what would be the typical number
> of bridges supported by hardware?

As far as I understand, only the datapath flows are offloaded to
hardware. The OF pipeline is only parsed when there is an upcall for the
first packet. Once resolved, the datapath flow is reused. OVS bridges
are only logical constructs, they are neither reflected in the datapath
nor in hardware.

> > >>> Use multicast for overlay networks
> > >>> ==================================
> > > [snip]
> > >>> - 24bit VNI allows for more than 16 million logical switches. No need
> > >>>  for extended GENEVE tunnel options.
> > >> Note that using vxlan at the moment significantly reduces the ovn
> > >> featureset. This is because the geneve header options are currently used
> > >> for data that would not fit into the vxlan vni.
> > >>
> > >> From ovn-architecture.7.xml:
> > >> ```
> > >> The maximum number of networks is reduced to 4096.
> > >> The maximum number of ports per network is reduced to 2048.
> > >> ACLs matching against logical ingress port identifiers are not supported.
> > >> OVN interconnection feature is not supported.
> > >> ```
> > >
> > > In my understanding, the main reason why GENEVE replaced VXLAN is
> > > because Openstack uses full mesh point to point tunnels and that the
> > > sender needs to know behind which chassis any mac address is to send it
> > > into the correct tunnel. GENEVE allowed to reduce the lookup time both
> > > on the sender and receiver thanks to ingress/egress port metadata.
> > >
> > > https://blog.russellbryant.net/2017/05/30/ovn-geneve-vs-vxlan-does-it-matter/
> > > https://dani.foroselectronica.es/ovn-geneve-encapsulation-541/
> > >
> > > If VXLAN + multicast and address learning was used, the "correct" tunnel
> > > would be established ad-hoc and both sender and receiver lookups would
> > > only be a simple mac forwarding with learning. The ingress pipeline
> > > would probably cost a little more.
> > >
> > > Maybe multicast + address learning could be implemented for GENEVE as
> > > well. But it would not be interoperable with other VTEPs.
>
> While it is true that it takes time before switch hardware picks up
> support for emerging protocols, I do not think it is a valid argument
> for limiting the development of OVN. Most hardware offload capable
> NICs already have GENEVE support, and if you survey recent or upcoming
> releases from top of rack switch vendors you will also find that they
> have added support for using GENEVE for hardware VTEPs. The fact that
> SDNs with a large customer footprint (such as NSX and OVN) make use of
> GENEVE is most likely a deciding factor for their adoption, and I see
> no reason why we should stop defining the edge of development in this
> space.

GENEVE could perfectly be suitable with a multicast based control plane
to establish ad-hoc tunnels without any centralized involvement.

I was only proposing VXLAN since this multicast group system was part of
the original RFC (supported in Linux since 3.12).

> > >>> - Limited and scoped "flooding" with IGMP/MLD snooping enabled in
> > >>>  top-of-rack switches. Multicast is only used for BUM traffic.
> > >>> - Only one VXLAN output port per implemented logical switch on a given
> > >>>  chassis.
> > >>
> > >> Would this actually work with one VXLAN output port? Would you not need
> > >> one port per target node to send unicast traffic (as you otherwise flood
> > >> all packets to all participating nodes)?
> > >
> > > You would need one VXLAN output port per implemented logical switch on
> > > a given chassis. The port would have a VNI (unique per logical switch)
> > > and an associated multicast IP address. Any chassis that implement this
> > > logical switch would subscribe to that multicast group. The flooding
> > > would be limited to first packets and broadcast/multicast traffic (ARP
> > > requests, mostly). Once the receiver node replies, all communication
> > > will happen with unicast.
> > >
> > > https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanaddresslearning/#BUM_Traffic
> > >
> > >>> Cons:
> > >>>
> > >>> - OVS does not support VXLAN address learning yet.
> > >>> - The number of usable multicast groups in a fabric network may be
> > >>>  limited?
> > >>> - How to manage seamless upgrades and interoperability with older OVN
> > >>>  versions?
> > >> - This pushes all logic related to chassis management to the
> > >>  underlying networking fabric. It thereby places additional
> > >>  requirements on the network fabric that have not been here before and
> > >>  that might not be available for all users.
> > >
> > > Are you aware of any fabric that does not support IGMP/MLD snooping?
>
> Have you ever operated a network without having issues with multicast? ;)

I must admit I don't have enough field experience operating large
network fabrics to state what issues multicast can cause with these.
This is why I raised this in the cons list :)

What specific issues did you have in mind?

> > >> - The bfd sessions between chassis are no longer possible thereby
> > >>  preventing fast failover of gateway chassis.
> > >
> > > I don't know what these BFD sessions are used for. But we could imagine
> > > an ad-hoc establishment of them when a tunnel is created.
> > >
> > >> As this idea requires VXLAN and all current limitation would apply to
> > >> this solution as well this is probably no general solution but rather a
> > >> deployment option.
> > >
> > > Yes, for backward compatibility, it would probably need to be opt-in.
>
> Would an alternative be to look at how we can make the existing
> communication infrastructure that OVN provides between the
> ovn-controllers more efficient for this use case? If you think about
> it, could it be used for "multicast" like operation? One of the issues
> with large L2s for OVN today is the population of every known mac
> address in the network to every chassis in the cloud. Would an
> alternative be to:
>
> - Each ovn-controller preprograms only the mac bindings for logical
>   switch ports residing on the hypervisor.
> - When learning of a remote MAC address is necessary, broadcast the
>   request only to tunnel endpoints where we know there are logical
>   switch ports for the same logical switch.
> - Add a local OVSDB instance for ovn-controller to store things such
>   as learned mac addresses instead of using the central DB for this
>   information.

I am afraid that it would complicate even more the current OVN. Why
reimplement existing network stack features in OVN?

I am eager to know if real multicast operation was ever considered, if
so, why was it discarded as a viable option. If not, could we consider
it?

> > >>> Connect ovn-controller to the northbound DB
> > >>> ===========================================
> > > [snip]
> > >>> For other components that require access to the southbound DB (e.g.
> > >>> neutron metadata agent), ovn-controller should provide an interface to
> > >>> expose state and configuration data for local consumption.
> > >>
> > >> Note that also ovn-interconnect uses access to the southbound DB to add
> > >> chassis of the interconnected site (and potentially some more magic).
> > >
> > > I was not aware of this. Thanks for the heads up.
> > >
> > >>> Pros:
> > > [snip]
> > >>
> > >> - one less codebase with northd gone
> > >>
> > >>> Cons:
> > >>>
> > >>> - This would be a serious API breakage for systems that depend on the
> > >>>  southbound DB.
> > >>> - Can all OVN constructs be implemented without a southbound DB?
> > >>> - Is the community interested in alternative datapaths?
> > >>
> > >> - It requires each ovn-controller to do that translation of a given
> > >>  construct (e.g. a logical switch) thereby probably increasing the cpu
> > >>  load and recompute time
> > >
> > > We cannot get this for free :) The CPU load that is gone from the
> > > central node needs to be shared across all chassis.
> > >
> > >> - The complexity of the ovn-controller grows as it gains nearly all
> > >>  logic of northd
> > >
> > > Agreed, but the complexity may not be that high. Since ovn-controller
> > > would not need to do a two staged translation from NB model to logical
> > > flows to openflow.
> > >
> > > Also, if we reuse OVS bridges to implement logical switches. There would
> > > be a reduced number of flows to compute.
> > >
> > >> I now understand what you meant with the alternative datapaths in your
> > >> first mail. While i find the option interesting i'm not sure how much
> > >> value actually would come out of that.
> > >
> > > I am having resource constrained environments in mind (DPUs/IPUs, edge
> > > nodes, etc). When available memory and CPU are limited, OVS may not the
> > > most efficient. Maybe using the plain linux (or BSD) networking stack
> > > would be perfectly suitable and more lightweight.
>
> I honestly do not think using linuxbridge as a datapath is an
> desirable option for multiple reasons:
>
> 1) There is no performant and hardware offloadable way to express ACLs
>    for linuxbridges.
> 2) There is no way to express L3 constructs for linuxbridges.
> 3) The current OVS OpenFlow bridge model is a perfect fit for
>    translating the intent into flows programmed directly into the
>    hardware switch on the NIC, and from my perspective this is one of
>    the main reasons why we are migrating the world onto OVS/OVN and
>    away from legacy implementations based on linuxbridges and network
>    namespaces.
>
> Accelerator cards/DPUs/IPUs are usually equipped with such hardware
> switches (implemented in ASIC or FPGA).

Let me first clarify one point, I am *not* suggesting to use linux
bridges and network namespaces as a first class replacement for OVS.
I am aware that the linux network stack has neither the level of
performance nor the determinism required for cloud and telco use cases.

What I am proposing is to make OVN more inclusive and decouple it from
the flow-based paradigm by allowing alternative implementations of the
northbound network intent.

I realize that this idea may be controversial for the community since
OVN has been closely tied to OVS since the start. However I am convinced
that this is a direction worth exploring, or at least discussing :)

> > > Also, since the northbound database doesn't know anything about flows,
> > > it could make OVN interoperable with any network capable element that is
> > > able to implement the network intent as described in the NB DB (<insert
> > > the name of your vrouter here>, etc.).
> > >
> > >> For me it feels like this would make ovn siginificantly harder to debug.
> > >
> > > Are you talking about ovn-trace? Or in general?
> > >
> > > Thanks for your comments.

Thanks everyone for the constructive discussion so far!

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - proposals

Reply via email to