Han, Thanks for your reply, and thanks for confirming my reading of code at the time as well: “from what I cans see, that raft.leader_sid are also updated in the only two places where raft.candidate_retrying (raft_start_election() and raft_set_leader()) is set. Which means it is not possible that raft.candidate_retrying is set to TRUE but raft->leader_sid is set to non-Zero”.
We saw it not very often, probably every half month or so. If it happens again, what information you think we should collect that can help with further investigation? Thanks Yun From: Han Zhou <[email protected]> Sent: Sunday, August 16, 2020 10:14 PM To: Yun Zhou <[email protected]> Cc: [email protected]; [email protected]; Girish Moodalbail <[email protected]> Subject: Re: the raft_is_connected state of a raft server stays as false and cannot recover External email: Use caution opening links or attachments On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou <[email protected]<mailto:[email protected]>> wrote: Hi, Need expert's view to address a problem we are seeing now and then: A ovsdb-server node in a 3-nodes raft cluster keeps printing out the "raft_is_connected: false" message, and its "connected" state in its _Server DB stays as false. According to the ovsdb-server(5) manpage, it means this server is not contacting with a majority of its cluster. Except its "connected" state, from what we can see, this server is in the follower state and works fine, and connection between it and the other two servers appear healthy as well. Below is its raft structure snapshot at the time of the problem. Note that its candidate_retrying field stays as true. Hopefully the provide information can help to figure out what goes wrong here. Unfortunately we don't have a solid case to reproduce it: Thanks for reporting the issue. This looks really strange. In the below state, leader_sid is non-zero, but candidate_retrying is true. According to the latest code, whenever leader_sid is set to non-zero (in raft_set_leader()), candidate_retrying will be set to false; whenever candidate_retrying is set to true (in raft_start_election()), leader_sid will be set to UUID_ZERO. And the data struct is initialized with xzalloc, making sure candidate_retrying is false in the beginning. So, sorry that I can't explain how it ends up with this conflict situation. It would be helpful if there is a way to reproduce. How often does it happen? Thanks, Han (gdb) print *(struct raft *)0xa872c0 $19 = { hmap_node = { hash = 2911123117, next = 0x0 }, log = 0xa83690, cid = { parts = {2699238234, 2258650653, 3035282424, 813064186} }, sid = { parts = {1071328836, 400573240, 2626104521, 1746414343} }, local_address = 0xa874e0 "tcp:10.8.51.55:6643<http://10.8.51.55:6643>", local_nickname = 0xa876d0 "3fdb", name = 0xa876b0 "OVN_Northbound", servers = { buckets = 0xad4bc0, one = 0x0, mask = 3, n = 3 }, election_timer = 1000, election_timer_new = 0, term = 3, vote = { parts = {1071328836, 400573240, 2626104521, 1746414343} }, synced_term = 3, synced_vote = { parts = {1071328836, 400573240, 2626104521, 1746414343} }, entries = 0xbf0fe0, log_start = 2, log_end = 312, log_synced = 311, allocated_log = 512, snap = { term = 1, data = 0xaafb10, eid = { parts = {1838862864, 1569866528, 2969429118, 3021055395} }, servers = 0xaafa70, election_timer = 1000 }, role = RAFT_FOLLOWER, commit_index = 311, last_applied = 311, leader_sid = { parts = {642765114, 43797788, 2533161504, 3088745929} }, election_base = 6043283367, election_timeout = 6043284593, joining = false, remote_addresses = { map = { buckets = 0xa87410, one = 0xa879c0, mask = 0, n = 1 } }, join_timeout = 6037634820, leaving = false, left = false, leave_timeout = 0, failed = false, waiters = { prev = 0xa87448, next = 0xa87448 }, listener = 0xaafad0, listen_backoff = -9223372036854775808, conns = { prev = 0xbcd660, next = 0xaafc20 }, add_servers = { buckets = 0xa87480, one = 0x0, mask = 0, n = 0 }, remove_server = 0x0, commands = { buckets = 0xa874a8, one = 0x0, mask = 0, n = 0 }, ping_timeout = 6043283700, n_votes = 1, candidate_retrying = true, had_leader = false, ever_had_leader = true } Thanks - Yun -- You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:ovn-kubernetes%[email protected]>. To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com.
_______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
