On 6/10/25 2:03 PM, Felix Huettner via dev wrote:
> Previously the needed garps and rarps where calculated in each loop of
> the ovn-controller in pinctrl_run. This is quite wasteful as most of the
> time nothing relevant changes. In large external networks this can have
> a significant performance impact on the ovn-controller.
> 
> Previous patches 
> https://patchwork.ozlabs.org/project/ovn/patch/zzxvora5keucn...@sit-sdelap1634.int.lidl.net/
> extraced just a limited part of the functionality to an engine node.
> This patchset now limits the logic in pinctrl_run to the absolute
> minimum.
> Also it addresses some issues with incremental processing changes while
> the southbound connection is unavailable.
> 
> v11->v12:
>   * rebased
> v10->v11:
>   * removed merged patches
>   * fix concurrency bug in updating announcement times
> v9->v10:
>   * rebased
>   * fixed countdown in first patch
> v8->v9:
>   * rebased
>   * rework usage of cmap for garp_rarp data
> v7->v8:
>   * rebased
>   * improved handling of daemon_started_recently
> v6->v7:
>   * rebased
>   * fixed ct-zone inconsitencies that break ci
> v5->v6: rebased
> 
> Felix Huettner (1):
>   controller: Extract garp_rarp to engine node.


Hello there.  Unfortunately, our ovn-heater scale CI reported a significant
performance regression this week.  Further bisection points to this change
being the root cause.

What's happening is after this change, ovn-controller is in a constant loop
of (forced) recomputes, which is very heavy at scale:

2025-06-22T16:14:02.292Z|02003|binding|INFO|Claiming lport lp-0-75 for this 
chassis.
2025-06-22T16:14:02.292Z|02004|binding|INFO|lp-0-75: Claiming 22:94:4c:ca:da:9b 
16.0.0.76
2025-06-22T16:14:03.317Z|02005|inc_proc_eng|INFO|node: lb_data, recompute 
(forced) took 613ms
2025-06-22T16:14:08.939Z|02006|inc_proc_eng|INFO|node: lflow_output, recompute 
(forced) took 5622ms
2025-06-22T16:14:09.322Z|02007|binding|INFO|Setting lport lp-0-75 ovn-installed 
in OVS
2025-06-22T16:14:09.322Z|02008|binding|INFO|Setting lport lp-0-75 up in 
Southbound
2025-06-22T16:14:09.323Z|02009|timeval|WARN|Unreasonably long 6964ms poll 
interval (5657ms user, 1287ms system)

Before this change, the cluster-density test, for example, didn't have any
long poll intervals and everything was computed incrementally.  After the
change it recomputes pretty much on every iteration and average ovn-installed
latency went up from 2.8 to 9.2 seconds.

Could someone, please, take a look?

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to