On Fri, Oct 29, 2021 at 9:43 AM Vladislav Odintsov <[email protected]> wrote: > > Hi Han, > > thanks for the fix. > Unfortunately too much time has gone since that failure and I’ve re-deployed my lab many times and all logs are unavailable. > I see this problem reproduces when I run our internal tests, which are quite intensively utilise NB. > I’ll try reproduce this problem on weekend. I should check for SB logs to see if failover was in place, right?
Yes, and it would be good to have timestamps for the roles (leader/follower/candidate) observed for each server from each server's perspective. > One question here: as far as I know TS lswitches’ tunnel keys are allocated in special space of IDs. > logical_switch’s other_config:requested-tnl-key is used, right? > Is it possible that lswitch is created with some tunnel_key value (for instance, 2) and then ovn-northd tries > to update tunnel_key and fails for some reason. Couldn’t it be a scenario for the duplicated datapath’s in sb db? I suspected that, too. However, I didn't see from the code updating tunnel_key would result in duplicated datapaths. In this case it is likely that two northds see different versions of the DBs: (my assumption) northd1: NB has requested-tnl-key populated, SB has the datapath with key 2 northd2: NB has the newly created TS without requested-tnl-key, SB has no datapath created for it Somehow both northds holds the SB lock (in different servers), so northd1 went ahead and updated the key in SB; northd2 went ahead and created the datapath. Thanks, Han > > Regards, > Vladislav Odintsov > > On 26 Oct 2021, at 03:55, Han Zhou <[email protected]> wrote: > > On Mon, Oct 4, 2021 at 3:02 PM Vladislav Odintsov <[email protected]> wrote: > > > Hi, > > I’ve faced with a next issue using latest OVN (master branch) with OVN-IC > > enabled: > > ovn-northd CPU utilisation was at ~70-80% and ovsdb-server serving > > OVN_Southbound DB was also under heavy load. > > > In ovn-northd.log file there were warnings: > 2021-10-01T11:14:43.548Z|18845|northd|INFO|deleting Datapath_Binding > > dd6af7f7-ea46-4496-988a-e7f9828924d0 with duplicate > external-ids:logical-switch/router ec35e3e0-2674-47e7-b645-2c9b8b31865b > > 2021-10-01T11:14:44.148Z|18846|poll_loop|INFO|Dropped 32 log messages in > > last 6 seconds (most recently, 0 seconds ago) due to excessive rate > > 2021-10-01T11:14:44.148Z|18847|poll_loop|INFO|wakeup due to [POLLIN] on > > fd 19 (172.20.33.110:55202<->172.20.33.102:16642) at lib/stream-ssl.c:832 > (69% CPU usage) > > 2021-10-01T11:14:48.336Z|18848|ovsdb_idl|WARN|Dropped 307 log messages in > > last 60 seconds (most recently, 0 seconds ago) due to excessive rate > > 2021-10-01T11:14:48.336Z|18849|ovsdb_idl|WARN|transaction error: > > {"details":"Deletion of 1 weak reference(s) to deleted (or never-existing) > rows from column \"datapath\" in \"IP_Multicast\" row > 72bac803-e484-4358-9e48-11911c8aa16f caused this column to become empty, > but constraints on this column disallow an empty > column.","error":"constraint violation"} > > > I checked datapath by reported logical-switch id: > > ~]# ovn-sbctl find datap > > external_ids:logical-switch=ec35e3e0-2674-47e7-b645-2c9b8b31865b > > _uuid : 7e045551-7700-4a50-b0aa-02cb4e1be59d > external_ids : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global, > > logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b", > name=vpc-CCF01DF6-rtb-216BABB1-global} > > load_balancers : [] > tunnel_key : 16712519 > > _uuid : dd6af7f7-ea46-4496-988a-e7f9828924d0 > external_ids : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global, > > logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b", > name=vpc-CCF01DF6-rtb-216BABB1-global} > > load_balancers : [] > tunnel_key : 5 > > > It refers to ovn-ic transit switch: > > ~]# ovn-ic-sbctl list datapath vpc-CCF01DF6-rtb-216BABB1-global > _uuid : fc159fa4-d5ba-46ed-a54c-d00745091021 > external_ids : {} > transit_switch : vpc-CCF01DF6-rtb-216BABB1-global > tunnel_key : 16712519 > > ~]# ovn-ic-nbctl list transit vpc-CCF01DF6-rtb-216BABB1-global > _uuid : b5312889-92f9-40fe-98f9-2ea7ce3debcc > external_ids : {} > name : vpc-CCF01DF6-rtb-216BABB1-global > other_config : {} > > The problem ip-multicast document: > > ~]# ovn-sbctl find ip-mul datapath=dd6af7f7-ea46-4496-988a-e7f9828924d0 > _uuid : 72bac803-e484-4358-9e48-11911c8aa16f > datapath : dd6af7f7-ea46-4496-988a-e7f9828924d0 > enabled : false > eth_src : "" > idle_timeout : 300 > ip4_src : "" > ip6_src : "" > querier : true > query_interval : 150 > query_max_resp : 1 > seq_no : 0 > table_size : 2048 > > I tried manually destroy this ip_multicast document, which blocked > > deletion of datapath document: > > ~]# ovn-sbctl destroy ip-mul 72bac803-e484-4358-9e48-11911c8aa16f > > ~]# ovn-sbctl list ip-mul 72bac803-e484-4358-9e48-11911c8aa16f > ovn-sbctl: no row "72bac803-e484-4358-9e48-11911c8aa16f" in table > > IP_Multicast > > > > And problem was resolved. ovn-northd daemon stopped consuming CPU and > > excess datapath was deleted automatically. > > > ~]# ovn-sbctl find datapath > > external_ids:interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global > > _uuid : 7e045551-7700-4a50-b0aa-02cb4e1be59d > external_ids : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global, > > logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b", > name=vpc-CCF01DF6-rtb-216BABB1-global} > > load_balancers : [] > tunnel_key : 16712519 > > The main problem here is that northd was entirely blocked. No processing > > to SB was done. Even ovn-controllers couldn’t claim new ports. No changes > were propagated, so this is critial issue. > > > Can anybody help understand the reason for this? > > Regards, > Vladislav Odintsov > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > Thanks for reporting the problem, and sorry for the slow response. > This is indeed a critical issue, but I am not sure how the duplicated > record is generated. What I can think of now is that when there is SB-DB > failover and there is a race during that period that at least two northd > processes both regard themselves as active so they both insert the > datapath_binding record when they see a new LS (in your case it happens to > be a TS) during this period. If this is how it happens, the race period > needs to be long enough (or the change rate in NB must be high enough), > otherwise it is not easy to reproduce. So the question is, did you see this > after a SB-DB failover? Any more timestamps to prove/disprove the above > assumption? > > At the same time, although it is not expected to have duplicated datapath > records created in the first place, permanently blocking northd in this > situation is a more serious problem. Fortunately this is something easy to > be fixed, so I just sent a patch with a test case covering this scenario: > > https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/ > > Thanks, > Han > _______________________________________________ > dev mailing list > [email protected] > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
