On Mon, Oct 4, 2021 at 3:02 PM Vladislav Odintsov <[email protected]> wrote: > > Hi, > > I’ve faced with a next issue using latest OVN (master branch) with OVN-IC enabled: > ovn-northd CPU utilisation was at ~70-80% and ovsdb-server serving OVN_Southbound DB was also under heavy load. > > In ovn-northd.log file there were warnings: > 2021-10-01T11:14:43.548Z|18845|northd|INFO|deleting Datapath_Binding dd6af7f7-ea46-4496-988a-e7f9828924d0 with duplicate external-ids:logical-switch/router ec35e3e0-2674-47e7-b645-2c9b8b31865b > 2021-10-01T11:14:44.148Z|18846|poll_loop|INFO|Dropped 32 log messages in last 6 seconds (most recently, 0 seconds ago) due to excessive rate > 2021-10-01T11:14:44.148Z|18847|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (172.20.33.110:55202<->172.20.33.102:16642) at lib/stream-ssl.c:832 (69% CPU usage) > 2021-10-01T11:14:48.336Z|18848|ovsdb_idl|WARN|Dropped 307 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate > 2021-10-01T11:14:48.336Z|18849|ovsdb_idl|WARN|transaction error: {"details":"Deletion of 1 weak reference(s) to deleted (or never-existing) rows from column \"datapath\" in \"IP_Multicast\" row 72bac803-e484-4358-9e48-11911c8aa16f caused this column to become empty, but constraints on this column disallow an empty column.","error":"constraint violation"} > > I checked datapath by reported logical-switch id: > > ~]# ovn-sbctl find datap external_ids:logical-switch=ec35e3e0-2674-47e7-b645-2c9b8b31865b > _uuid : 7e045551-7700-4a50-b0aa-02cb4e1be59d > external_ids : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global, logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b", name=vpc-CCF01DF6-rtb-216BABB1-global} > load_balancers : [] > tunnel_key : 16712519 > > _uuid : dd6af7f7-ea46-4496-988a-e7f9828924d0 > external_ids : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global, logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b", name=vpc-CCF01DF6-rtb-216BABB1-global} > load_balancers : [] > tunnel_key : 5 > > > It refers to ovn-ic transit switch: > > ~]# ovn-ic-sbctl list datapath vpc-CCF01DF6-rtb-216BABB1-global > _uuid : fc159fa4-d5ba-46ed-a54c-d00745091021 > external_ids : {} > transit_switch : vpc-CCF01DF6-rtb-216BABB1-global > tunnel_key : 16712519 > > ~]# ovn-ic-nbctl list transit vpc-CCF01DF6-rtb-216BABB1-global > _uuid : b5312889-92f9-40fe-98f9-2ea7ce3debcc > external_ids : {} > name : vpc-CCF01DF6-rtb-216BABB1-global > other_config : {} > > The problem ip-multicast document: > > ~]# ovn-sbctl find ip-mul datapath=dd6af7f7-ea46-4496-988a-e7f9828924d0 > _uuid : 72bac803-e484-4358-9e48-11911c8aa16f > datapath : dd6af7f7-ea46-4496-988a-e7f9828924d0 > enabled : false > eth_src : "" > idle_timeout : 300 > ip4_src : "" > ip6_src : "" > querier : true > query_interval : 150 > query_max_resp : 1 > seq_no : 0 > table_size : 2048 > > I tried manually destroy this ip_multicast document, which blocked deletion of datapath document: > ~]# ovn-sbctl destroy ip-mul 72bac803-e484-4358-9e48-11911c8aa16f > > ~]# ovn-sbctl list ip-mul 72bac803-e484-4358-9e48-11911c8aa16f > ovn-sbctl: no row "72bac803-e484-4358-9e48-11911c8aa16f" in table IP_Multicast > > > And problem was resolved. ovn-northd daemon stopped consuming CPU and excess datapath was deleted automatically. > > ~]# ovn-sbctl find datapath external_ids:interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global > _uuid : 7e045551-7700-4a50-b0aa-02cb4e1be59d > external_ids : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global, logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b", name=vpc-CCF01DF6-rtb-216BABB1-global} > load_balancers : [] > tunnel_key : 16712519 > > The main problem here is that northd was entirely blocked. No processing to SB was done. Even ovn-controllers couldn’t claim new ports. No changes were propagated, so this is critial issue. > > Can anybody help understand the reason for this? > > Regards, > Vladislav Odintsov > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Thanks for reporting the problem, and sorry for the slow response. This is indeed a critical issue, but I am not sure how the duplicated record is generated. What I can think of now is that when there is SB-DB failover and there is a race during that period that at least two northd processes both regard themselves as active so they both insert the datapath_binding record when they see a new LS (in your case it happens to be a TS) during this period. If this is how it happens, the race period needs to be long enough (or the change rate in NB must be high enough), otherwise it is not easy to reproduce. So the question is, did you see this after a SB-DB failover? Any more timestamps to prove/disprove the above assumption? At the same time, although it is not expected to have duplicated datapath records created in the first place, permanently blocking northd in this situation is a more serious problem. Fortunately this is something easy to be fixed, so I just sent a patch with a test case covering this scenario: https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/ Thanks, Han _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
