[ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

Tony Liu Thu, 23 Jul 2020 10:34:26 -0700

Changed the title for this specific problem.
I looked into logs and have more findings.
The problem was happening when sb-db leader switched.

For ovsdb cluster, what may trigger the leader switch? Given the log,
2020-07-21T01:08:38.119Z|00074|raft|INFO|term 2: 1135 ms timeout expired, 
starting election
The election is asked by a follower node. Is that because the connection from 
follower to leader timeout,
then follower assumes the leader is dead and starts an election?

For ovn-northd (3 instances), they all connect to the sb-db leader, whoever has 
the locker is the master.
When sb-db leader switches, all ovn-northd instances look for the new leader. 
In this case, there is no
guarantee that the old ovn-northd master remains the role, other ovn-northd 
instance may find the
leader and acquire the lock first. So, the sb-db leader switch may also cause 
ovn-northd master switch.
Such switch may happen in the middle of ovn-northd transaction, in that case, 
is there any guarantee to
the transaction completeness? My guess is that, the older created a 
datapath-binding for a logical-switch,
switch happened when this transaction is not completed, then the new 
master/leader created another
data-path binding for the same logical-switch. Does it make any sense?

>From the log, when sb-db switched, ovn-northd master connected to the new 
>leader and lost the master,
but immediately, it acquired the lock and become master again. Not sure how 
this happened.

Here are some loggings.
==== .84 sb-db leader =================
2020-07-21T01:08:20.221Z|01408|raft|INFO|current entry eid 
639238ba-bc00-4efe-bb66-6ac766bb5f4b does not match prerequisite 
78e8e167-8b4c-4292-8e25-d9975631b010 in execute_command_request

2020-07-21T01:08:38.450Z|01409|timeval|WARN|Unreasonably long 1435ms poll 
interval (1135ms user, 43ms system)
2020-07-21T01:08:38.451Z|01410|timeval|WARN|faults: 5942 minor, 0 major
2020-07-21T01:08:38.451Z|01411|timeval|WARN|disk: 0 reads, 50216 writes
2020-07-21T01:08:38.452Z|01412|timeval|WARN|context switches: 60 voluntary, 25 
involuntary
2020-07-21T01:08:38.453Z|01413|coverage|INFO|Skipping details of duplicate 
event coverage for hash=45329980

2020-07-21T01:08:38.455Z|01414|raft|WARN|ignoring vote request received as 
leader
2020-07-21T01:08:38.456Z|01415|raft|INFO|server 1f9e is leader for term 2
2020-07-21T01:08:38.457Z|01416|raft|INFO|rejected append_reply (not leader)
2020-07-21T01:08:38.471Z|01417|raft|INFO|rejected append_reply (not leader)

2020-07-21T01:23:00.890Z|01418|timeval|WARN|Unreasonably long 1336ms poll 
interval (1102ms user, 20ms system)

==== .85 sb-db ==================
2020-07-21T01:05:25.734Z|00061|coverage|INFO|Skipping details of duplicate 
event coverage for hash=a1e16f5b

2020-07-21T01:08:38.166Z|00062|raft|INFO|server 1f9e is leader for term 2
2020-07-21T01:08:38.455Z|00063|raft|INFO|rejecting term 1 < current term 2 
received in append_request message from server 39a9

2020-07-21T01:23:44.500Z|00064|timeval|WARN|Unreasonably long 1447ms poll 
interval (1118ms user, 33ms system)

==== .86 sb-db =================
2020-07-21T01:02:35.607Z|00073|coverage|INFO|70 events never hit

2020-07-21T01:08:38.119Z|00074|raft|INFO|term 2: 1135 ms timeout expired, 
starting election
2020-07-21T01:08:38.165Z|00075|raft|INFO|term 2: elected leader by 2+ of 3 
servers
2020-07-21T01:08:38.455Z|00076|raft|INFO|rejecting term 1 < current term 2 
received in append_request message from server 39a9
2020-07-21T01:08:38.456Z|00077|raft|INFO|rejecting term 1 < current term 2 
received in vote_reply message from server 39a9

2020-07-21T01:12:51.363Z|00078|timeval|WARN|Unreasonably long 1466ms poll 
interval (1186ms user, 27ms system)

==== .85 ovn-northd master ==============
2020-07-21T01:08:25.925Z|00479|poll_loop|INFO|wakeup due to [POLLIN] on fd 8 
(10.6.20.85:45478<->10.6.20.84:6641) at lib/stream-fd.c:157 (54% CPU usage)

2020-07-21T01:08:38.464Z|00480|ovsdb_idl|INFO|tcp:10.6.20.84:6642: clustered 
database server is not cluster leader; trying another server
2020-07-21T01:08:38.465Z|00481|reconnect|INFO|tcp:10.6.20.84:6642: connection 
attempt timed out
2020-07-21T01:08:38.465Z|00482|reconnect|INFO|tcp:10.6.20.86:6642: connecting...
2020-07-21T01:08:38.465Z|00483|reconnect|INFO|tcp:10.6.20.86:6642: connected
2020-07-21T01:08:38.465Z|00484|ovn_northd|INFO|ovn-northd lock lost. This 
ovn-northd instance is now on standby.
2020-07-21T01:08:38.465Z|00485|ovn_northd|INFO|ovn-northd lock acquired. This 
ovn-northd instance is now active.
2020-07-21T01:08:38.966Z|00486|ovn_northd|INFO|deleting Datapath_Binding 
108cf745-db82-43c0-a9d3-afe27a41e4aa with duplicate 
external-ids:logical-switch/router 8a5d1d3c-e9fc-4cbe-a461-98ff838e6473

2020-07-20T21:32:48.803Z|00019|reconnect|INFO|tcp:10.6.20.84:6641: connected

==== .86 ovn-northd ==============
2020-07-21T01:08:38.463Z|00020|ovsdb_idl|INFO|tcp:10.6.20.84:6642: clustered 
database server is not cluster leader; trying another server
2020-07-21T01:08:38.464Z|00021|reconnect|INFO|tcp:10.6.20.84:6642: connection 
attempt timed out
2020-07-21T01:08:38.464Z|00022|reconnect|INFO|tcp:10.6.20.85:6642: connecting...
2020-07-21T01:08:38.464Z|00023|reconnect|INFO|tcp:10.6.20.85:6642: connected
2020-07-21T01:08:38.464Z|00024|ovn_northd|INFO|ovn-northd lock acquired. This 
ovn-northd instance is now active.
2020-07-21T01:08:38.465Z|00025|ovsdb_idl|INFO|tcp:10.6.20.85:6642: clustered 
database server is not cluster leader; trying another server
2020-07-21T01:08:38.466Z|00026|reconnect|INFO|tcp:10.6.20.85:6642: connection 
attempt timed out
2020-07-21T01:08:39.467Z|00027|reconnect|INFO|tcp:10.6.20.86:6642: connecting...
2020-07-21T01:08:39.468Z|00028|reconnect|INFO|tcp:10.6.20.86:6642: connected
2020-07-21T01:08:39.468Z|00029|ovn_northd|INFO|ovn-northd lock lost. This 
ovn-northd instance is now on standby.

2020-07-21T02:04:22.718Z|00030|ovsdb_idl|INFO|tcp:10.6.20.86:6642: clustered 
database server is not cluster leader; trying another server

Thanks!

Tony
________________________________

From: Numan Siddique <[email protected]>
Sent: July 23, 2020 03:15 AM
To: Tony Liu <[email protected]>
Cc: ovs-dev <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: [ovs-discuss] OVN nb-db and sb-db out of sync

On Thu, Jul 23, 2020 at 8:22 AM Tony Liu 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I see why sb-db broke at 1568th port-binding.
The 1568th datapath-binding in sb-db references the same

_uuid               : 108cf745-db82-43c0-a9d3-afe27a41e4aa
external_ids        : {logical-switch="8a5d1d3c-e9fc-4cbe-a461-98ff838e6473", 
name=neutron-e907dc17-f1e8-4217-a37d-86e9a98c86c2, name2=net-97-192}
tunnel_key          : 1567

_uuid               : d934ed92-2f3c-4b31-8a76-2a5047a3bb46
external_ids        : {logical-switch="8a5d1d3c-e9fc-4cbe-a461-98ff838e6473", 
name=neutron-e907dc17-f1e8-4217-a37d-86e9a98c86c2, name2=net-97-192}
tunnel_key          : 1568

I don't believe this is supposed to happen. Any idea how could it happen?
Then ovn-northd is stuck in trying to delete this duplication, and it ignores 
all the following updates.
That caused out-of-sync between nb-db and sb-db.
Any way I can fix it manually, like with ovn-sbctl to delete it?

If you delete the ovn sb db resources manually, ovn-northd should sync it up.

But I'm surprised why ovn-northd didn't sync earlier. There's something wrong 
related to raft going
on here. Not sure what.

Thanks
Numan

Thanks!

Tony

________________________________
From: dev 
<[email protected]<mailto:[email protected]>> on 
behalf of Tony Liu <[email protected]<mailto:[email protected]>>
Sent: July 22, 2020 11:33 AM
To: ovs-dev <[email protected]<mailto:[email protected]>>
Subject: [ovs-dev] OVN nb-db and sb-db out of sync

Hi,

During a scaling test where 4000 networks are created from OpenStack, I see that
nb-db and sb-db are out of sync. All 4000 logical switches and 8000 LS ports
(GW port and service port of each network) are created in nb-db. In sb-db,
only 1567 port-bindings, 4000 is expected.

[root@ovn-db-2 ~]# ovn-nbctl list nb_global
_uuid               : b7b3aa05-f7ed-4dbc-979f-10445ac325b8
connections         : []
external_ids        : {"neutron:liveness_check_at"="2020-07-22 
04:03:17.726917+00:00"}
hv_cfg              : 312
ipsec               : false
name                : ""
nb_cfg              : 2636
options             : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
sb_cfg              : 2005
ssl                 : []

[root@ovn-db-2 ~]# ovn-sbctl list sb_global
_uuid               : 3720bc1d-b0da-47ce-85ca-96fa8d398489
connections         : []
external_ids        : {}
ipsec               : false
nb_cfg              : 312
options             : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
ssl                 : []

Is there any way to force ovn-northd to rebuild sb-db to sync with nb-db,
like manipulating nb_cfg or anything else? Note, it's 3-node RAFT cluster for 
both
nb-db and sb-db.

Is that "incremental update" implemented in 20.03?
If not, in which release it's going to be available?

Thanks!

Tony

_______________________________________________
dev mailing list
[email protected]<mailto:[email protected]>
https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
discuss mailing list
[email protected]<mailto:[email protected]>
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] OVN: Two datapath-bindings are created for the same logical-switch

Reply via email to