On Fri, Oct 29, 2021 at 9:43 AM Vladislav Odintsov <[email protected]>
wrote:
>
> Hi Han,
>
> thanks for the fix.
> Unfortunately too much time has gone since that failure and I’ve
re-deployed my lab many times and all logs are unavailable.
> I see this problem reproduces when I run our internal tests, which are
quite intensively utilise NB.
> I’ll try reproduce this problem on weekend. I should check for SB logs to
see if failover was in place, right?

Yes, and it would be good to have timestamps for the roles
(leader/follower/candidate) observed for each server from each server's
perspective.

> One question here: as far as I know TS lswitches’ tunnel keys are
allocated in special space of IDs.
> logical_switch’s other_config:requested-tnl-key is used, right?
> Is it possible that lswitch is created with some tunnel_key value (for
instance, 2) and then ovn-northd tries
> to update tunnel_key and fails for some reason. Couldn’t it be a scenario
for the duplicated datapath’s in sb db?

I suspected that, too. However, I didn't see from the code updating
tunnel_key would result in duplicated datapaths. In this case it is likely
that two northds see different versions of the DBs:

(my assumption)
northd1: NB has requested-tnl-key populated, SB has the datapath with key 2
northd2: NB has the newly created TS without requested-tnl-key, SB has no
datapath created for it
Somehow both northds holds the SB lock (in different servers), so northd1
went ahead and updated the key in SB; northd2 went ahead and created the
datapath.

Thanks,
Han

>
> Regards,
> Vladislav Odintsov
>
> On 26 Oct 2021, at 03:55, Han Zhou <[email protected]> wrote:
>
> On Mon, Oct 4, 2021 at 3:02 PM Vladislav Odintsov <[email protected]>
wrote:
>
>
> Hi,
>
> I’ve faced with a next issue using latest OVN (master branch) with OVN-IC
>
> enabled:
>
> ovn-northd CPU utilisation was at ~70-80% and ovsdb-server serving
>
> OVN_Southbound DB was also under heavy load.
>
>
> In ovn-northd.log file there were warnings:
> 2021-10-01T11:14:43.548Z|18845|northd|INFO|deleting Datapath_Binding
>
> dd6af7f7-ea46-4496-988a-e7f9828924d0 with duplicate
> external-ids:logical-switch/router ec35e3e0-2674-47e7-b645-2c9b8b31865b
>
> 2021-10-01T11:14:44.148Z|18846|poll_loop|INFO|Dropped 32 log messages in
>
> last 6 seconds (most recently, 0 seconds ago) due to excessive rate
>
> 2021-10-01T11:14:44.148Z|18847|poll_loop|INFO|wakeup due to [POLLIN] on
>
> fd 19 (172.20.33.110:55202<->172.20.33.102:16642) at lib/stream-ssl.c:832
> (69% CPU usage)
>
> 2021-10-01T11:14:48.336Z|18848|ovsdb_idl|WARN|Dropped 307 log messages in
>
> last 60 seconds (most recently, 0 seconds ago) due to excessive rate
>
> 2021-10-01T11:14:48.336Z|18849|ovsdb_idl|WARN|transaction error:
>
> {"details":"Deletion of 1 weak reference(s) to deleted (or never-existing)
> rows from column \"datapath\" in \"IP_Multicast\" row
> 72bac803-e484-4358-9e48-11911c8aa16f caused this column to become empty,
> but constraints on this column disallow an empty
> column.","error":"constraint violation"}
>
>
> I checked datapath by reported logical-switch id:
>
> ~]# ovn-sbctl find datap
>
> external_ids:logical-switch=ec35e3e0-2674-47e7-b645-2c9b8b31865b
>
> _uuid               : 7e045551-7700-4a50-b0aa-02cb4e1be59d
> external_ids        : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global,
>
> logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b",
> name=vpc-CCF01DF6-rtb-216BABB1-global}
>
> load_balancers      : []
> tunnel_key          : 16712519
>
> _uuid               : dd6af7f7-ea46-4496-988a-e7f9828924d0
> external_ids        : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global,
>
> logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b",
> name=vpc-CCF01DF6-rtb-216BABB1-global}
>
> load_balancers      : []
> tunnel_key          : 5
>
>
> It refers to ovn-ic transit switch:
>
> ~]# ovn-ic-sbctl list datapath vpc-CCF01DF6-rtb-216BABB1-global
> _uuid               : fc159fa4-d5ba-46ed-a54c-d00745091021
> external_ids        : {}
> transit_switch      : vpc-CCF01DF6-rtb-216BABB1-global
> tunnel_key          : 16712519
>
> ~]# ovn-ic-nbctl list transit vpc-CCF01DF6-rtb-216BABB1-global
> _uuid               : b5312889-92f9-40fe-98f9-2ea7ce3debcc
> external_ids        : {}
> name                : vpc-CCF01DF6-rtb-216BABB1-global
> other_config        : {}
>
> The problem ip-multicast document:
>
> ~]# ovn-sbctl find ip-mul datapath=dd6af7f7-ea46-4496-988a-e7f9828924d0
> _uuid               : 72bac803-e484-4358-9e48-11911c8aa16f
> datapath            : dd6af7f7-ea46-4496-988a-e7f9828924d0
> enabled             : false
> eth_src             : ""
> idle_timeout        : 300
> ip4_src             : ""
> ip6_src             : ""
> querier             : true
> query_interval      : 150
> query_max_resp      : 1
> seq_no              : 0
> table_size          : 2048
>
> I tried manually destroy this ip_multicast document, which blocked
>
> deletion of datapath document:
>
> ~]# ovn-sbctl destroy ip-mul 72bac803-e484-4358-9e48-11911c8aa16f
>
> ~]# ovn-sbctl list ip-mul 72bac803-e484-4358-9e48-11911c8aa16f
> ovn-sbctl: no row "72bac803-e484-4358-9e48-11911c8aa16f" in table
>
> IP_Multicast
>
>
>
> And problem was resolved. ovn-northd daemon stopped consuming CPU and
>
> excess datapath was deleted automatically.
>
>
> ~]# ovn-sbctl find datapath
>
> external_ids:interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global
>
> _uuid               : 7e045551-7700-4a50-b0aa-02cb4e1be59d
> external_ids        : {interconn-ts=vpc-CCF01DF6-rtb-216BABB1-global,
>
> logical-switch="ec35e3e0-2674-47e7-b645-2c9b8b31865b",
> name=vpc-CCF01DF6-rtb-216BABB1-global}
>
> load_balancers      : []
> tunnel_key          : 16712519
>
> The main problem here is that northd was entirely blocked. No processing
>
> to SB was done. Even ovn-controllers couldn’t claim new ports. No changes
> were propagated, so this is critial issue.
>
>
> Can anybody help understand the reason for this?
>
> Regards,
> Vladislav Odintsov
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
> Thanks for reporting the problem, and sorry for the slow response.
> This is indeed a critical issue, but I am not sure how the duplicated
> record is generated. What I can think of now is that when there is SB-DB
> failover and there is a race during that period that at least two northd
> processes both regard themselves as active so they both insert the
> datapath_binding record when they see a new LS (in your case it happens to
> be a TS) during this period. If this is how it happens, the race period
> needs to be long enough (or the change rate in NB must be high enough),
> otherwise it is not easy to reproduce. So the question is, did you see
this
> after a SB-DB failover? Any more timestamps to prove/disprove the above
> assumption?
>
> At the same time, although it is not expected to have duplicated datapath
> records created in the first place, permanently blocking northd in this
> situation is a more serious problem. Fortunately this is something easy to
> be fixed, so I just sent a patch with a test case covering this scenario:
>
>
https://patchwork.ozlabs.org/project/ovn/patch/[email protected]/
>
> Thanks,
> Han
> _______________________________________________
> dev mailing list
> [email protected]
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to