On 9/18/23 11:34, Brendan Doyle via discuss wrote:
> Hi Folks,
>
> So we have run into an issue when using an Address Set (AS) containing a
> number of CIDRs in an ACL.
> Initially we observed out of memory (OOM) kills of
> ovs-vswitchd/ovn-controller on ovn-controller
> node chassis, we then upgrade to OVN/OVS latest LTS, this fixed the OOM
> kills but now see
> ovs-vswitchd/ovn-controller hit 100% CPU on ovn-controller node
> chassis, whether it is ovs-vswitchd
> or ovn-controller seems to depend on the size and number of CIDRs in the
> AS. I have narrowed down
> the issue to be 100% reproducible with this very basic OVN configuration
> that just has a gateway with
> distributed router port:
>
>
> ovn-nbctl show
> switch 5f666cde-217c-487d-9470-86e6cb197f39 (ls_vcn2_external_gw)
> port ls_vcn2_external_gw-lr_vcn2_gw
> type: router
> router-port: lr_vcn2_gw-ls_vcn2_external_gw
> port ln-ls_vcn2_external_gw
> type: localnet
> addresses: ["unknown"]
> router 484cbb41-bc6b-4be3-b79e-c05696146786 (lr_vcn2_gw)
> port lr_vcn2_gw-ls_vcn2_external_gw
> mac: "40:44:00:00:01:80"
> networks: ["253.255.80.6/16"]
> gateway chassis: [pcacn003 pcacn001 pcacn002]
>
> It has the following ALC on external gateway switch:
>
> ovn-nbctl acl-list ls_vcn2_external_gw
> from-lport 32700 (inport == "ls_vcn2_external_gw-lr_vcn2_gw" && (ip4.dst
> == 10.80.179.0/28 && ip4.src != $vcn2_as_10_80_179_0_28)) drop
> log(name=vcn2_as_10_80_179_0_28_gw)
>
> The AS has the following CIDRs:
>
> ovn-nbctl list Address_Set vcn2_as_10_80_179_0_28 | grep addre
> addresses : ["192.17.1.0/28", "192.17.1.16/28",
> "192.17.1.32/28", "192.17.1.48/28" ,"192.17.1.64/28", "192.17.1.80/28",
> "192.17.1.96/28"]
>
> If I restrict the AS to just 4 CIDRs, everything is OK, but with 7 or
> more ovs-vswitchd reaches 100% CPU on all ovn-controller
> chassis and stays there. If I change the size of the CIDRs to /24 then
> ovn-controller hits 100% CPU with 5 or more CIDRs in the AS.
>
> Is this a known issue? or what can be happening here?
Hi, Brendan. It is a known issue caused by having many different
networks in the negative match. It is causing an explosion of the
number of OpenFlow rules.
The solution is to try and reduce the number of negative ('!=')
matches in your expression or turn the ACL into multiple ACLs with
positive matches instead, if possible. E.g. by using an explicit
'allow' ACL for these CIDRs and a lower priority catch-all 'deny'.
The issue was mostly addressed in OVN 23.06 by the following commit:
https://github.com/ovn-org/ovn/commit/422ab29e76b5d5bf3865fc743434cb958ce20cc8
So, if many negative matches are unavoidable, you should consider
upgrading to that version or a most recent 23.09 release.
Best regards, Ilya Maximets.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss