Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

Han Zhou via discuss Fri, 29 Sep 2023 16:17:44 -0700

On Fri, Sep 29, 2023 at 7:26 AM Robin Jarry <rja...@redhat.com> wrote:
>
> Hi Felix,
>
> Thanks a lot for your message.
>
> Felix Huettner, Sep 29, 2023 at 14:35:
> > I can get that when running 10k ovn-controllers the benefits of
> > optimizing cpu and memory load are quite significant. However i am
> > unsure about reducing the footprint of ovn-northd.
> > When running so many nodes i would have assumed that having an
> > additional (or maybe two) dedicated machines for ovn-northd would
> > be completely acceptable, as long as it can still actually do what
> > it should in a reasonable timeframe.
> > Would the goal for ovn-northd be more like "Reduce the full/incremental
> > recompute time" then?


+1

>
> The main goal of this thread is to get a consensus on the actual issues
> that prevent scaling at the moment. We can discuss solutions in the
> other thread.
>

Thanks for the good discussions!

> > > * Allow support for alternative datapath implementations.
> >
> > Does this mean ovs datapths (e.g. dpdk) or something different?
>
> See the other thread.
>
> > > Southbound Design
> > > =================
> ...
> > Note that also ovn-controller consumes the "state" of other chassis to
> > e.g build the tunnels to other chassis. To visualize my understanding
> >
> > +----------------+---------------+------------+
> > |                | configuration |   state    |
> > +----------------+---------------+------------+
> > |   ovn-northd   |  write-only   | read-only  |
> > +----------------+---------------+------------+
> > | ovn-controller |   read-only   | read-write |
> > +----------------+---------------+------------+
> > |    some cms    |  no access?   | read-only  |
> > +----------------+---------------+------------+
>
> I think ovn-controller only consumes the logical flows. The chassis and
> port bindings tables are used by northd to updated these logical flows.
>

Felix was right. For example, port-binding is firstly a configuration from
north-bound, but the states such as its physical location (the chassis
column) are populated by ovn-controller of the owning chassis and consumed
by other ovn-controllers that are interested in that port-binding.

> > > Centralized decisions
> > > =====================
> > >
> > > Every chassis needs to be "aware" of all other chassis in the cluster.
> >
> > I think we need to accept this as fundamental truth. Indepentent if you
> > look at centralized designs like ovn or the neutron-l2 implementation
> > or if you look at decentralized designs like bgp or spanning tree. In
> > all cases if we need some kind of organized communication we need to
> > know all relevant peers.
> > Designs might diverge if you need to be "aware" of all peers or just
> > some of them, but that is just a tradeoff between data size and options
> > you have to forward data.
> >
> > > This requirement mainly comes from overlay networks that are
implemented
> > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some
> > > limitations). It is not a scaling issue by itself, but it implies
> > > a centralized decision which in turn puts pressure on the central node
> > > at scale.
> >
> > +1. On the other hand it removes signaling needs between the nodes (like
> > you would have with bgp).
>
> Exactly, but was the signaling between the nodes ever an issue?

I am not an expert of BGP, but at least for what I am aware of, there are
scaling issues in things like BGP full mesh signaling, and there are
solutions such as route reflector (which is again centralized) to solve
such issues.

>
> > > Due to ovsdb monitoring and caching, any change in the southbound DB
> > > (either by northd or by any of the chassis controllers) is replicated
on
> > > every chassis. The monitor_all option is often enabled on large
clusters
> > > to avoid the conditional monitoring CPU cost on the central node.
> >
> > This is, i guess, something that should be possible to fix. We have also
> > enabled this setting as it gave us stability improvements and we do not
> > yet see performance issues with it
>
> So you have enabled monitor_all=true as well? Or did you test at scale
> with monitor_all=false.
>
We do use monitor_all=false, primarily to reduce memory footprint (and also
CPU cost of IDL processing) on each chassis. There are trade-offs to the SB
DB server performance:
- On one hand it increases the cost of conditional monitoring, which is
expensive for sure
- On the other hand, it reduces the total amount of data for the server to
propagate to clients

It really depends on your topology for making the choice. If most of the
nodes would anyway monitor most of the DB data (something similar to a
full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in
topology like ovn-kubernetes where each node has its dedicated part of the
data, or in topologies where you have lots of small "island" such as a
cloud with many small tenants that never talks to each other, using
monitor_all=false could make sense (but still need to be carefully
evaluated and tested for your own use cases).

> What I am saying is that without monitor_all=true, the southbound
> ovsdb-server needs to do checks to determine what updates to send to
> which client. Since the server is single threaded, it becomes an issue
> at scale. I know that there were some significant improvements made
> recently but it will only push the limit further. I don't have hard data
> to prove my point yet unfortunately.
>
> > > This leads to high memory usage on all chassis, control plane traffic
> > > and possible disruptions in the ovs-vswitchd datapath flow cache.
> > > Unfortunately, I don't have any hard data to back this claim. This is
> > > mainly coming from discussions I had with neutron contributors and
from
> > > brainstorming sessions with colleagues.
> >
> > Could you maybe elaborate on the datapath flow cache issue, as it sounds
> > like it might affect actual live traffic and i am not aware of details
> > there.
>
> I may have had a wrong understanding of the mechanisms of OVS here.
> I was under the impression that any update of the openflow rules would
> invalidate of all datapath flows. It is far more subtle than this [1].
> So unless there is an actual change in the packet pipeline, live traffic
> should not be affected.
>
> [1]
https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained
>

In normal cases I don't think monitor_all setting has an impact to OVS
flows, not to mention the flow cache.
ovn-controller has separate logic to determine what flows should be
programmed to OVS, regardless of the monitoring strategy. The extra SB data
is just useless in the ovn-controller's memory.
Well, there can still be some corner cases where some useless data is
blindly used and programmed to OVS, and if you see such things we can
definitely optimize in ovn-controller. Even though, as you also mentioned,
it doesn't necessarily mean the OVS megaflow cache would be invalidated.

> > The memory usage and the traffic would be fixed by not having to rely on
> > monitor_all, right?
>
> The memory usage would be reduced but I don't know to which point. One
> of the main consumers is the logical flows table which is required
> everywhere. Unless there is a way to only sync a portion of this table
> depending on the chassis, disabling monitor_all would save syncing the
> unneeded tables for ovn-controller: chassis, port bindings, etc.

Probably it wasn't what you meant, but I'd like to clarify that it is not
about unneeded tables, but unneeded rows in those tables (mainly
logical_flow and port_binding).
It indeed syncs only a portion of the tables. It is not depending directly
on chassis, but depending on what port-bindings are on the chassis and what
logical connectivity those port-bindings have. So, again, the choice really
depends on your use cases.

>
> > > Dynamic mac learning
> > > ====================
> > >
> > > Logical switch ports on a given chassis are all connected to the same
> > > OVS bridge, in the same VLAN. This prevents from using local mac
address
> > > learning and shifts the responsibility to a centralized ovn-northd to
> > > create all the required logical flows to properly segment the network.
> > >
> > > When using mac_address=unknown ports, centralized mac learning is
> > > enabled and when a new address is seen entering a port, OVS sends it
to
> > > the local controller which updates the FDB table and recomputes flow
> > > rules accordingly. With logical switches spanning across a large
number
> > > of chassis, this centralized mac address learning and aging can have
an
> > > impact on control plane and dataplane performance.
> >
> > Might one solution for that be to generate such lookup flows without the
> > help from northd like it is already done for some other tables? Northd
> > could precreate the flow that is then templated for each entry in
> > MAC_Bindings relevant to the respective router.
>
> Let's discuss solutions in the other thread.
>
> What do you think?
>
Thanks again for the thoughts and proposals. I will respond to the other
thread.

Thanks,
Han

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN: scaling L2 networks beyond 10k chassis - issues

Reply via email to