On 6/10/25 2:03 PM, Felix Huettner via dev wrote: > Previously the needed garps and rarps where calculated in each loop of > the ovn-controller in pinctrl_run. This is quite wasteful as most of the > time nothing relevant changes. In large external networks this can have > a significant performance impact on the ovn-controller. > > Previous patches > https://patchwork.ozlabs.org/project/ovn/patch/zzxvora5keucn...@sit-sdelap1634.int.lidl.net/ > extraced just a limited part of the functionality to an engine node. > This patchset now limits the logic in pinctrl_run to the absolute > minimum. > Also it addresses some issues with incremental processing changes while > the southbound connection is unavailable. > > v11->v12: > * rebased > v10->v11: > * removed merged patches > * fix concurrency bug in updating announcement times > v9->v10: > * rebased > * fixed countdown in first patch > v8->v9: > * rebased > * rework usage of cmap for garp_rarp data > v7->v8: > * rebased > * improved handling of daemon_started_recently > v6->v7: > * rebased > * fixed ct-zone inconsitencies that break ci > v5->v6: rebased > > Felix Huettner (1): > controller: Extract garp_rarp to engine node.
Hello there. Unfortunately, our ovn-heater scale CI reported a significant performance regression this week. Further bisection points to this change being the root cause. What's happening is after this change, ovn-controller is in a constant loop of (forced) recomputes, which is very heavy at scale: 2025-06-22T16:14:02.292Z|02003|binding|INFO|Claiming lport lp-0-75 for this chassis. 2025-06-22T16:14:02.292Z|02004|binding|INFO|lp-0-75: Claiming 22:94:4c:ca:da:9b 16.0.0.76 2025-06-22T16:14:03.317Z|02005|inc_proc_eng|INFO|node: lb_data, recompute (forced) took 613ms 2025-06-22T16:14:08.939Z|02006|inc_proc_eng|INFO|node: lflow_output, recompute (forced) took 5622ms 2025-06-22T16:14:09.322Z|02007|binding|INFO|Setting lport lp-0-75 ovn-installed in OVS 2025-06-22T16:14:09.322Z|02008|binding|INFO|Setting lport lp-0-75 up in Southbound 2025-06-22T16:14:09.323Z|02009|timeval|WARN|Unreasonably long 6964ms poll interval (5657ms user, 1287ms system) Before this change, the cluster-density test, for example, didn't have any long poll intervals and everything was computed incrementally. After the change it recomputes pretty much on every iteration and average ovn-installed latency went up from 2.8 to 9.2 seconds. Could someone, please, take a look? Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev