On Mon, Oct 4, 2021 at 3:02 PM Vladislav Odintsov <[email protected]> wrote:
>
> Hi,
>
> I’ve faced with a next issue using latest OVN (master branch) with OVN-IC
enabled:
> ovn-northd CPU utilisation was at ~70-80% and ovsdb-server serving
OVN_Southbound DB was also under heavy load.
>
> In ovn-northd.log file there were warnings:
> 2021-10-01T11:14:43.548Z|18845|northd|INFO|deleting Datapath_Binding
dd6af7f7-ea46-4496-988a-e7f9828924d0 with duplicate
external-ids:logical-switch/router ec35e3e0-2674-47e7-b645-2c9b8b31865b
> 2021-10-01T11:14:44.148Z|18846|poll_loop|INFO|Dropped 32 log messages in
last 6 seconds (most recently, 0 seconds ago) due to excessive rate
> 2021-10-01T11:14:44.148Z|18847|poll_loop|INFO|wakeup due to [POLLIN] on
fd 19 (172.20.33.110:55202<->172.20.33.102:16642) at lib/stream-ssl.c:832
(69% CPU usage)
> 2021-10-01T11:14:48.336Z|18848|ovsdb_idl|WARN|Dropped 307 log messages in
last 60 seconds (most recently, 0 seconds ago) due to excessive rate
> 2021-10-01T11:14:48.336Z|18849|ovsdb_idl|WARN|transaction error:
{"details":"Deletion of 1 weak reference(s) to deleted (or never-existing)
rows from column \"datapath\" in \"IP_Multicast\" row
72bac803-e484-4358-9e48-11911c8aa16f caused this column to become empty,
but constraints on this column disallow an empty
column.","error":"constraint violation"}
>
> I checked datapath by reported logical-switch id:
>
>  ~]# ovn-sbctl find datap
external_ids:logical-switch=ec35e3e0-2674-47e7-b645-2c9b8b31865b
> _uuid               : 7e045551-7700-4a50-b0aa-02cb4e1be59d
> external_ids        : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global,
logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b",
name=vpc-CCF01DF6-rtb-216BABB1-global}
> load_balancers      : []
> tunnel_key          : 16712519
>
> _uuid               : dd6af7f7-ea46-4496-988a-e7f9828924d0
> external_ids        : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global,
logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b",
name=vpc-CCF01DF6-rtb-216BABB1-global}
> load_balancers      : []
> tunnel_key          : 5
>
>
> It refers to ovn-ic transit switch:
>
>  ~]# ovn-ic-sbctl list datapath vpc-CCF01DF6-rtb-216BABB1-global
> _uuid               : fc159fa4-d5ba-46ed-a54c-d00745091021
> external_ids        : {}
> transit_switch      : vpc-CCF01DF6-rtb-216BABB1-global
> tunnel_key          : 16712519
>
>  ~]# ovn-ic-nbctl list transit vpc-CCF01DF6-rtb-216BABB1-global
> _uuid               : b5312889-92f9-40fe-98f9-2ea7ce3debcc
> external_ids        : {}
> name                : vpc-CCF01DF6-rtb-216BABB1-global
> other_config        : {}
>
> The problem ip-multicast document:
>
>  ~]# ovn-sbctl find ip-mul datapath=dd6af7f7-ea46-4496-988a-e7f9828924d0
> _uuid               : 72bac803-e484-4358-9e48-11911c8aa16f
> datapath            : dd6af7f7-ea46-4496-988a-e7f9828924d0
> enabled             : false
> eth_src             : ""
> idle_timeout        : 300
> ip4_src             : ""
> ip6_src             : ""
> querier             : true
> query_interval      : 150
> query_max_resp      : 1
> seq_no              : 0
> table_size          : 2048
>
> I tried manually destroy this ip_multicast document, which blocked
deletion of datapath document:
>  ~]# ovn-sbctl destroy ip-mul 72bac803-e484-4358-9e48-11911c8aa16f
>
>  ~]# ovn-sbctl list ip-mul 72bac803-e484-4358-9e48-11911c8aa16f
> ovn-sbctl: no row "72bac803-e484-4358-9e48-11911c8aa16f" in table
IP_Multicast
>
>
> And problem was resolved. ovn-northd daemon stopped consuming CPU and
excess datapath was deleted automatically.
>
>  ~]# ovn-sbctl find datapath
external_ids:interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global
> _uuid               : 7e045551-7700-4a50-b0aa-02cb4e1be59d
> external_ids        : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global,
logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b",
name=vpc-CCF01DF6-rtb-216BABB1-global}
> load_balancers      : []
> tunnel_key          : 16712519
>
> The main problem here is that northd was entirely blocked. No processing
to SB was done. Even ovn-controllers couldn’t claim new ports. No changes
were propagated, so this is critial issue.
>
> Can anybody help understand the reason for this?
>
> Regards,
> Vladislav Odintsov
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Thanks for reporting the problem, and sorry for the slow response.
This is indeed a critical issue, but I am not sure how the duplicated
record is generated. What I can think of now is that when there is SB-DB
failover and there is a race during that period that at least two northd
processes both regard themselves as active so they both insert the
datapath_binding record when they see a new LS (in your case it happens to
be a TS) during this period. If this is how it happens, the race period
needs to be long enough (or the change rate in NB must be high enough),
otherwise it is not easy to reproduce. So the question is, did you see this
after a SB-DB failover? Any more timestamps to prove/disprove the above
assumption?

At the same time, although it is not expected to have duplicated datapath
records created in the first place, permanently blocking northd in this
situation is a more serious problem. Fortunately this is something easy to
be fixed, so I just sent a patch with a test case covering this scenario:

https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/

Thanks,
Han
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to