On Fri, Sep 29, 2023 at 7:26 AM Robin Jarry <rja...@redhat.com> wrote: > > Hi Felix, > > Thanks a lot for your message. > > Felix Huettner, Sep 29, 2023 at 14:35: > > I can get that when running 10k ovn-controllers the benefits of > > optimizing cpu and memory load are quite significant. However i am > > unsure about reducing the footprint of ovn-northd. > > When running so many nodes i would have assumed that having an > > additional (or maybe two) dedicated machines for ovn-northd would > > be completely acceptable, as long as it can still actually do what > > it should in a reasonable timeframe. > > Would the goal for ovn-northd be more like "Reduce the full/incremental > > recompute time" then?
+1 > > The main goal of this thread is to get a consensus on the actual issues > that prevent scaling at the moment. We can discuss solutions in the > other thread. > Thanks for the good discussions! > > > * Allow support for alternative datapath implementations. > > > > Does this mean ovs datapths (e.g. dpdk) or something different? > > See the other thread. > > > > Southbound Design > > > ================= > ... > > Note that also ovn-controller consumes the "state" of other chassis to > > e.g build the tunnels to other chassis. To visualize my understanding > > > > +----------------+---------------+------------+ > > | | configuration | state | > > +----------------+---------------+------------+ > > | ovn-northd | write-only | read-only | > > +----------------+---------------+------------+ > > | ovn-controller | read-only | read-write | > > +----------------+---------------+------------+ > > | some cms | no access? | read-only | > > +----------------+---------------+------------+ > > I think ovn-controller only consumes the logical flows. The chassis and > port bindings tables are used by northd to updated these logical flows. > Felix was right. For example, port-binding is firstly a configuration from north-bound, but the states such as its physical location (the chassis column) are populated by ovn-controller of the owning chassis and consumed by other ovn-controllers that are interested in that port-binding. > > > Centralized decisions > > > ===================== > > > > > > Every chassis needs to be "aware" of all other chassis in the cluster. > > > > I think we need to accept this as fundamental truth. Indepentent if you > > look at centralized designs like ovn or the neutron-l2 implementation > > or if you look at decentralized designs like bgp or spanning tree. In > > all cases if we need some kind of organized communication we need to > > know all relevant peers. > > Designs might diverge if you need to be "aware" of all peers or just > > some of them, but that is just a tradeoff between data size and options > > you have to forward data. > > > > > This requirement mainly comes from overlay networks that are implemented > > > over a full-mesh of point-to-point GENEVE tunnels (or VXLAN with some > > > limitations). It is not a scaling issue by itself, but it implies > > > a centralized decision which in turn puts pressure on the central node > > > at scale. > > > > +1. On the other hand it removes signaling needs between the nodes (like > > you would have with bgp). > > Exactly, but was the signaling between the nodes ever an issue? I am not an expert of BGP, but at least for what I am aware of, there are scaling issues in things like BGP full mesh signaling, and there are solutions such as route reflector (which is again centralized) to solve such issues. > > > > Due to ovsdb monitoring and caching, any change in the southbound DB > > > (either by northd or by any of the chassis controllers) is replicated on > > > every chassis. The monitor_all option is often enabled on large clusters > > > to avoid the conditional monitoring CPU cost on the central node. > > > > This is, i guess, something that should be possible to fix. We have also > > enabled this setting as it gave us stability improvements and we do not > > yet see performance issues with it > > So you have enabled monitor_all=true as well? Or did you test at scale > with monitor_all=false. > We do use monitor_all=false, primarily to reduce memory footprint (and also CPU cost of IDL processing) on each chassis. There are trade-offs to the SB DB server performance: - On one hand it increases the cost of conditional monitoring, which is expensive for sure - On the other hand, it reduces the total amount of data for the server to propagate to clients It really depends on your topology for making the choice. If most of the nodes would anyway monitor most of the DB data (something similar to a full-mesh), it is more reasonable to use monitor_all=true. Otherwise, in topology like ovn-kubernetes where each node has its dedicated part of the data, or in topologies where you have lots of small "island" such as a cloud with many small tenants that never talks to each other, using monitor_all=false could make sense (but still need to be carefully evaluated and tested for your own use cases). > What I am saying is that without monitor_all=true, the southbound > ovsdb-server needs to do checks to determine what updates to send to > which client. Since the server is single threaded, it becomes an issue > at scale. I know that there were some significant improvements made > recently but it will only push the limit further. I don't have hard data > to prove my point yet unfortunately. > > > > This leads to high memory usage on all chassis, control plane traffic > > > and possible disruptions in the ovs-vswitchd datapath flow cache. > > > Unfortunately, I don't have any hard data to back this claim. This is > > > mainly coming from discussions I had with neutron contributors and from > > > brainstorming sessions with colleagues. > > > > Could you maybe elaborate on the datapath flow cache issue, as it sounds > > like it might affect actual live traffic and i am not aware of details > > there. > > I may have had a wrong understanding of the mechanisms of OVS here. > I was under the impression that any update of the openflow rules would > invalidate of all datapath flows. It is far more subtle than this [1]. > So unless there is an actual change in the packet pipeline, live traffic > should not be affected. > > [1] https://developers.redhat.com/articles/2022/10/19/open-vswitch-revalidator-process-explained > In normal cases I don't think monitor_all setting has an impact to OVS flows, not to mention the flow cache. ovn-controller has separate logic to determine what flows should be programmed to OVS, regardless of the monitoring strategy. The extra SB data is just useless in the ovn-controller's memory. Well, there can still be some corner cases where some useless data is blindly used and programmed to OVS, and if you see such things we can definitely optimize in ovn-controller. Even though, as you also mentioned, it doesn't necessarily mean the OVS megaflow cache would be invalidated. > > The memory usage and the traffic would be fixed by not having to rely on > > monitor_all, right? > > The memory usage would be reduced but I don't know to which point. One > of the main consumers is the logical flows table which is required > everywhere. Unless there is a way to only sync a portion of this table > depending on the chassis, disabling monitor_all would save syncing the > unneeded tables for ovn-controller: chassis, port bindings, etc. Probably it wasn't what you meant, but I'd like to clarify that it is not about unneeded tables, but unneeded rows in those tables (mainly logical_flow and port_binding). It indeed syncs only a portion of the tables. It is not depending directly on chassis, but depending on what port-bindings are on the chassis and what logical connectivity those port-bindings have. So, again, the choice really depends on your use cases. > > > > Dynamic mac learning > > > ==================== > > > > > > Logical switch ports on a given chassis are all connected to the same > > > OVS bridge, in the same VLAN. This prevents from using local mac address > > > learning and shifts the responsibility to a centralized ovn-northd to > > > create all the required logical flows to properly segment the network. > > > > > > When using mac_address=unknown ports, centralized mac learning is > > > enabled and when a new address is seen entering a port, OVS sends it to > > > the local controller which updates the FDB table and recomputes flow > > > rules accordingly. With logical switches spanning across a large number > > > of chassis, this centralized mac address learning and aging can have an > > > impact on control plane and dataplane performance. > > > > Might one solution for that be to generate such lookup flows without the > > help from northd like it is already done for some other tables? Northd > > could precreate the flow that is then templated for each entry in > > MAC_Bindings relevant to the respective router. > > Let's discuss solutions in the other thread. > > What do you think? > Thanks again for the thoughts and proposals. I will respond to the other thread. Thanks, Han
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss