Hi,
I’ve got a 3-node RAFT cluster, serving ovn northbound DB and trying to
understand what triggered ovsdb-server leadership change.
Can somebody help explain that?
This cluster runs ovs 2.13.4 and has ~25 active clients: 3 x ovn-northd, 3 x
ovn-ic, 11 python-ovsdbapp clients (CMS).
Server logs (per-node) listed above.
Node 1:
2021-08-12T07:45:55.189Z|03131|raft|INFO|server be61 is leader for term 357
2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout expired,
starting election
2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+ of 3
servers
2021-08-12T07:46:05.558Z|03134|raft|INFO|rejecting term 357 < current term 358
received in vote_reply message from server be61
2021-08-12T07:46:29.181Z|03135|timeval|WARN|Unreasonably long 2122ms poll
interval (2120ms user, 0ms system)
2021-08-12T07:46:29.181Z|03136|timeval|WARN|context switches: 0 voluntary, 7
involuntary
2021-08-12T07:46:29.181Z|03137|coverage|INFO|Event coverage, avg rate over
last: 5 seconds, last minute, last hour, hash=3b99bba2:
2021-08-12T07:46:29.181Z|03138|coverage|INFO|hmap_pathological 94.0/sec
9.917/sec 0.1819/sec total: 85714
2021-08-12T07:46:29.181Z|03139|coverage|INFO|hmap_expand
39281.4/sec 4318.283/sec 112.9786/sec total: 193444648
2021-08-12T07:46:29.181Z|03140|coverage|INFO|lockfile_lock 0.0/sec
0.000/sec 0.0000/sec total: 1
2021-08-12T07:46:29.181Z|03141|coverage|INFO|poll_create_node 201.6/sec
109.617/sec 49.4042/sec total: 425254534
2021-08-12T07:46:29.181Z|03142|coverage|INFO|poll_zero_timeout 2.0/sec
1.567/sec 0.9350/sec total: 2938208
2021-08-12T07:46:29.181Z|03143|coverage|INFO|seq_change 13.0/sec
10.433/sec 5.5928/sec total: 22320268
2021-08-12T07:46:29.181Z|03144|coverage|INFO|pstream_open 0.0/sec
0.000/sec 0.0000/sec total: 4
2021-08-12T07:46:29.181Z|03145|coverage|INFO|stream_open 0.0/sec
0.000/sec 0.0000/sec total: 20
2021-08-12T07:46:29.181Z|03146|coverage|INFO|unixctl_received 0.0/sec
0.017/sec 0.0164/sec total: 54512
2021-08-12T07:46:29.181Z|03147|coverage|INFO|unixctl_replied 0.0/sec
0.017/sec 0.0164/sec total: 54512
2021-08-12T07:46:29.181Z|03148|coverage|INFO|util_xalloc
1693863.2/sec 181037.517/sec 3963.9467/sec total: 4830747405
2021-08-12T07:46:29.181Z|03149|coverage|INFO|97 events never hit
2021-08-12T07:46:29.186Z|03150|raft|WARN|ignoring vote request received as
leader
2021-08-12T07:46:29.186Z|03151|raft|INFO|server be61 is leader for term 359
2021-08-12T07:46:29.187Z|03152|raft|INFO|1076 truncating 1 entries from end of
log
2021-08-12T07:46:29.187Z|03153|raft|INFO|rejected append_reply (not leader)
2021-08-12T07:46:29.187Z|03154|raft|INFO|rejected append_reply (not leader)
2021-08-12T07:46:29.187Z|03155|raft|INFO|rejected append_reply (not leader)
2021-08-12T07:46:29.191Z|03156|raft|INFO|rejected append_reply (not leader)
2021-08-12T07:46:29.221Z|03157|jsonrpc|WARN|Dropped 4 log messages in last
14866 seconds (most recently, 14865 seconds ago) due to excessive rate
2021-08-12T07:46:29.221Z|03158|jsonrpc|WARN|tcp:client-1:42402: receive error:
Connection reset by peer
2021-08-12T07:46:29.222Z|03159|reconnect|WARN|tcp:client-1:42402: connection
dropped (Connection reset by peer)
2021-08-12T07:46:29.222Z|03160|jsonrpc|WARN|tcp:client-2:45746: receive error:
Connection reset by peer
2021-08-12T07:46:29.222Z|03161|reconnect|WARN|tcp:client-2:45746: connection
dropped (Connection reset by peer)
2021-08-12T07:46:29.225Z|03162|jsonrpc|WARN|tcp:client-3:54218: receive error:
Connection reset by peer
2021-08-12T07:46:29.225Z|03163|reconnect|WARN|tcp:client-3:54218: connection
dropped (Connection reset by peer)
2021-08-12T07:46:29.232Z|03164|jsonrpc|WARN|tcp:client-4:48064: receive error:
Connection reset by peer
2021-08-12T07:46:29.232Z|03165|reconnect|WARN|tcp:client-4:48064: connection
dropped (Connection reset by peer)
2021-08-12T07:46:29.232Z|03166|jsonrpc|WARN|tcp:client-5:49022: receive error:
Connection reset by peer
2021-08-12T07:46:29.232Z|03167|reconnect|WARN|tcp:client-5:49022: connection
dropped (Connection reset by peer)
2021-08-12T07:46:29.566Z|03168|reconnect|WARN|tcp:node-3:40902: connection
dropped (Connection reset by peer)
2021-08-12T07:46:30.047Z|03169|poll_loop|INFO|Dropped 64 log messages in last
14879 seconds (most recently, 14876 seconds ago) due to excessive rate
2021-08-12T07:46:30.047Z|03170|poll_loop|INFO|wakeup due to [POLLOUT] on fd 49
(node-1:6641<->ip7:48658) at lib/stream-fd.c:153 (72% CPU usage)
2021-08-12T07:46:30.195Z|03171|poll_loop|INFO|wakeup due to [POLLIN] on fd 20
(node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
2021-08-12T07:46:30.351Z|03172|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43
(node-1:6641<->ip8:48656) at lib/stream-fd.c:153 (72% CPU usage)
2021-08-12T07:46:30.529Z|03173|poll_loop|INFO|wakeup due to [POLLIN] on fd 20
(node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
2021-08-12T07:46:30.753Z|03174|poll_loop|INFO|wakeup due to [POLLOUT] on fd 39
(node-1:6641<->ip9:48062) at lib/stream-fd.c:153 (72% CPU usage)
2021-08-12T07:46:30.864Z|03175|poll_loop|INFO|wakeup due to [POLLIN] on fd 20
(node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
2021-08-12T07:46:30.980Z|03176|poll_loop|INFO|wakeup due to [POLLOUT] on fd 49
(node-1:6641<->ip10:48658) at lib/stream-fd.c:153 (72% CPU usage)
2021-08-12T07:46:31.156Z|03177|poll_loop|INFO|wakeup due to [POLLOUT] on fd 45
(node-1:6641<->ip11:54216) at lib/stream-fd.c:153 (72% CPU usage)
2021-08-12T07:46:31.197Z|03178|poll_loop|INFO|wakeup due to [POLLIN] on fd 20
(node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
2021-08-12T07:46:31.198Z|03179|poll_loop|INFO|wakeup due to 0-ms timeout at
lib/reconnect.c:664 (72% CPU usage)
2021-08-12T07:46:46.654Z|03180|raft|INFO|server 9f6e is leader for term 360
2021-08-12T07:46:46.954Z|03181|raft|INFO|rejecting term 359 < current term 360
received in append_request message from server be61
Node 2:
2021-08-12T07:45:54.748Z|02385|raft|INFO|Transferring leadership to write a
snapshot.
2021-08-12T07:45:55.058Z|02386|raft|INFO|server be61 is leader for term 357
2021-08-12T07:45:55.185Z|02387|raft|INFO|rejecting term 356 < current term 357
received in append_reply message from server 1076
2021-08-12T07:46:04.192Z|02388|raft|INFO|server 1076 is leader for term 358
2021-08-12T07:46:28.339Z|02389|raft|INFO|server be61 is leader for term 359
2021-08-12T07:46:29.177Z|02390|raft|INFO|rejecting term 358 < current term 359
received in append_request message from server 1076
2021-08-12T07:46:29.183Z|02391|raft|INFO|rejecting term 358 < current term 359
received in append_request message from server 1076
2021-08-12T07:46:46.650Z|02392|raft|INFO|term 360: 1425 ms timeout expired,
starting election
2021-08-12T07:46:46.654Z|02393|raft|INFO|term 360: elected leader by 2+ of 3
servers
2021-08-12T07:46:46.954Z|02394|raft|INFO|rejecting term 359 < current term 360
received in append_request message from server be61
Node 3:
2021-08-12T07:45:54.750Z|03526|raft|INFO|received leadership transfer from 9f6e
in term 356
2021-08-12T07:45:54.750Z|03527|raft|INFO|term 357: starting election
2021-08-12T07:45:55.057Z|03528|raft|INFO|term 357: elected leader by 2+ of 3
servers
2021-08-12T07:46:05.558Z|03529|raft|WARN|ignoring vote request received as
leader
2021-08-12T07:46:05.558Z|03530|raft|INFO|server 1076 is leader for term 358
2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll
interval (2733ms user, 0ms system)
2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major
2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes
2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0 voluntary, 19
involuntary
2021-08-12T07:46:05.558Z|03535|coverage|INFO|Skipping details of duplicate
event coverage for hash=3b99bba2
2021-08-12T07:46:05.559Z|03536|poll_loop|INFO|Dropped 59 log messages in last
8160 seconds (most recently, 8157 seconds ago) due to excessive rate
2021-08-12T07:46:05.559Z|03537|poll_loop|INFO|wakeup due to [POLLOUT] on fd 37
(node-3:6641<->client-1:49952) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03538|poll_loop|INFO|wakeup due to [POLLOUT] on fd 46
(node-3:6641<->client-2:39512) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03539|poll_loop|INFO|wakeup due to [POLLOUT] on fd 35
(node-3:6641<->client-3:35686) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03540|poll_loop|INFO|wakeup due to [POLLOUT] on fd 47
(node-3:6641<->client-4:49954) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03541|poll_loop|INFO|wakeup due to [POLLOUT] on fd 39
(node-3:6641<->client-5:33490) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03542|poll_loop|INFO|wakeup due to [POLLOUT] on fd 45
(node-3:6641<->client-6:41156) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03543|poll_loop|INFO|wakeup due to [POLLOUT] on fd 40
(node-3:6641<->client-7:40098) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03544|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43
(node-3:6641<->client-8:38632) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03545|poll_loop|INFO|wakeup due to [POLLOUT] on fd 36
(node-3:6641<->client-9:34912) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.559Z|03546|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41
(node-3:6641<->client-10:55638) at lib/stream-fd.c:153 (57% CPU usage)
2021-08-12T07:46:05.896Z|03547|jsonrpc|WARN|tcp:client-11:43772: send error:
Broken pipe
2021-08-12T07:46:05.896Z|03548|reconnect|WARN|tcp:client-11:43772: connection
dropped (Broken pipe)
2021-08-12T07:46:05.928Z|03549|jsonrpc|WARN|tcp:client-12:55640: receive error:
Connection reset by peer
2021-08-12T07:46:05.929Z|03550|reconnect|WARN|tcp:client-12:55640: connection
dropped (Connection reset by peer)
2021-08-12T07:46:19.068Z|03551|jsonrpc|WARN|tcp:client-13:33490: receive error:
Connection reset by peer
2021-08-12T07:46:19.069Z|03552|reconnect|WARN|tcp:client-13:33490: connection
dropped (Connection reset by peer)
2021-08-12T07:46:19.176Z|03553|jsonrpc|WARN|tcp:client-14:38632: receive error:
Connection reset by peer
2021-08-12T07:46:19.177Z|03554|reconnect|WARN|tcp:client-14:38632: connection
dropped (Connection reset by peer)
2021-08-12T07:46:20.712Z|03555|jsonrpc|WARN|tcp:client-15:34912: receive error:
Connection reset by peer
2021-08-12T07:46:20.713Z|03556|reconnect|WARN|tcp:client-15:34912: connection
dropped (Connection reset by peer)
2021-08-12T07:46:23.453Z|03557|jsonrpc|WARN|tcp:client-16:41158: receive error:
Connection reset by peer
2021-08-12T07:46:23.453Z|03558|reconnect|WARN|tcp:client-16:41158: connection
dropped (Connection reset by peer)
2021-08-12T07:46:28.336Z|03559|raft|INFO|term 359: 1318 ms timeout expired,
starting election
2021-08-12T07:46:28.339Z|03560|raft|INFO|term 359: elected leader by 2+ of 3
servers
2021-08-12T07:46:29.178Z|03561|raft|INFO|rejecting term 358 < current term 359
received in append_request message from server 1076
2021-08-12T07:46:29.182Z|03562|raft|INFO|rejecting term 358 < current term 359
received in append_request message from server 1076
2021-08-12T07:46:29.187Z|03563|raft|INFO|rejecting term 358 < current term 359
received in vote_reply message from server 1076
2021-08-12T07:46:46.957Z|03564|timeval|WARN|Unreasonably long 1645ms poll
interval (1642ms user, 0ms system)
2021-08-12T07:46:46.957Z|03565|timeval|WARN|faults: 163 minor, 0 major
2021-08-12T07:46:46.957Z|03566|timeval|WARN|context switches: 0 voluntary, 22
involuntary
2021-08-12T07:46:46.957Z|03567|coverage|INFO|Skipping details of duplicate
event coverage for hash=3b99bba2
2021-08-12T07:46:47.233Z|03568|raft|INFO|rejected append_reply (not leader)
2021-08-12T07:46:47.233Z|03569|raft|WARN|ignoring vote request received after
only 3 ms (minimum election time is 1000 ms)
2021-08-12T07:46:47.233Z|03570|raft|INFO|server 9f6e is leader for term 360
2021-08-12T07:46:47.234Z|03571|raft|INFO|rejected append_reply (not leader)
2021-08-12T07:46:48.290Z|03572|poll_loop|INFO|Dropped 49 log messages in last
43 seconds (most recently, 40 seconds ago) due to excessive rate
2021-08-12T07:46:48.290Z|03573|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43
(node-3:6641<->client-17:34970) at lib/stream-fd.c:153 (64% CPU usage)
2021-08-12T07:46:48.446Z|03574|poll_loop|INFO|wakeup due to [POLLIN] on fd 30
(node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
2021-08-12T07:46:48.647Z|03575|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41
(node-3:6641<->client-18:38648) at lib/stream-fd.c:153 (64% CPU usage)
2021-08-12T07:46:48.780Z|03576|poll_loop|INFO|wakeup due to [POLLIN] on fd 30
(node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
2021-08-12T07:46:48.919Z|03577|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41
(node-3:6641<->client-19:38648) at lib/stream-fd.c:153 (64% CPU usage)
2021-08-12T07:46:49.113Z|03578|poll_loop|INFO|wakeup due to [POLLIN] on fd 30
(1node-32.20.1.125:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
2021-08-12T07:46:49.296Z|03579|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41
(node-3:6641<->client-20:38648) at lib/stream-fd.c:153 (64% CPU usage)
2021-08-12T07:46:50.837Z|03580|jsonrpc|WARN|tcp:client-21:35700: receive error:
Connection reset by peer
2021-08-12T07:46:50.837Z|03581|reconnect|WARN|tcp:client-21:35700: connection
dropped (Connection reset by peer)
2021-08-12T07:46:55.042Z|03582|jsonrpc|WARN|tcp:client-22:39840: receive error:
Connection reset by peer
2021-08-12T07:46:55.043Z|03583|reconnect|WARN|tcp:client-22:39840: connection
dropped (Connection reset by peer)
2021-08-12T07:47:01.015Z|03584|jsonrpc|WARN|tcp:client-23:55826: receive error:
Connection reset by peer
2021-08-12T07:47:01.015Z|03585|reconnect|WARN|tcp:client-23:55826: connection
dropped (Connection reset by peer)
First leader ownership transfer (at 07:45:54) is understandable.
node-3 received leadership transfer from node-1 because node-1 was prepearing
to take a DB snapshot/compaction.
But then at 07:46:04.187Z node-1 wrote:
2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout expired,
starting election
2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+ of 3
servers
I’m not sure, but I suppose node-1 didn’t get heartbet from leader (node-3),
started leader election and won it. Right?
My question is: how to understand the reason why node-3 couldn’t send raft
heartbeat in time?
Can this long poll be the indicator that server was not answering heartbeats?
2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll
interval (2733ms user, 0ms system)
2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major
2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes
2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0 voluntary, 19
involuntary
Thanks in advance for your help.
Regards,
Vladislav Odintsov
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss