Han,

I just find out that we are using the ovs directly build from upstream 
branch-2.13. It seems this branch does not have the following commit:

# git log -p -1 cdae6100f8
commit cdae6100f89d04c5c29dc86a490b936a204622b7
Author: Han Zhou <[email protected]<mailto:[email protected]>>
Date:   Thu Mar 5 23:48:46 2020 -0800

  raft: Unset leader when starting election.

From my read of the code, lack of this commit could cause a missing 
raft_set_leader() call therefore candidate_retrying could stay as true

Please let me know if my understand is correct. If so, the problem should be 
fixed in upstream/master branch.

Sorry for the confusion about the version of the ovs.

Thanks
Yun




From: Han Zhou <[email protected]>
Sent: Sunday, August 16, 2020 10:14 PM
To: Yun Zhou <[email protected]>
Cc: [email protected]; [email protected]; Girish 
Moodalbail <[email protected]>
Subject: Re: the raft_is_connected state of a raft server stays as false and 
cannot recover

External email: Use caution opening links or attachments



On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Need expert's view to address a problem we are seeing now and then:  A 
ovsdb-server node in a 3-nodes raft cluster keeps printing out the 
"raft_is_connected: false" message, and its "connected" state in its _Server DB 
stays as false.

According to the ovsdb-server(5) manpage, it means this server is not 
contacting with a majority of its cluster.

Except its "connected" state, from what we can see, this server is in the 
follower state and works fine, and connection between it and the other two 
servers appear healthy as well.

Below is its raft structure snapshot at the time of the problem. Note that its 
candidate_retrying field stays as true.

Hopefully the provide information can help to figure out what goes wrong here. 
Unfortunately we don't have a solid case to reproduce it:

Thanks for reporting the issue. This looks really strange. In the below state, 
leader_sid is non-zero, but candidate_retrying is true.
According to the latest code, whenever leader_sid is set to non-zero (in 
raft_set_leader()), candidate_retrying will be set to false; whenever 
candidate_retrying is set to true (in raft_start_election()), leader_sid will 
be set to UUID_ZERO. And the data struct is initialized with xzalloc, making 
sure candidate_retrying is false in the beginning. So, sorry that I can't 
explain how it ends up with this conflict situation. It would be helpful if 
there is a way to reproduce. How often does it happen?

Thanks,
Han


(gdb) print *(struct raft *)0xa872c0
$19 = {
  hmap_node = {
    hash = 2911123117,
    next = 0x0
  },
  log = 0xa83690,
  cid = {
    parts = {2699238234, 2258650653, 3035282424, 813064186}
  },
  sid = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  local_address = 0xa874e0 "tcp:10.8.51.55:6643<http://10.8.51.55:6643>",
  local_nickname = 0xa876d0 "3fdb",
  name = 0xa876b0 "OVN_Northbound",
  servers = {
    buckets = 0xad4bc0,
    one = 0x0,
    mask = 3,
    n = 3
  },
  election_timer = 1000,
  election_timer_new = 0,
  term = 3,
  vote = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  synced_term = 3,
  synced_vote = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  entries = 0xbf0fe0,
  log_start = 2,
  log_end = 312,
  log_synced = 311,
  allocated_log = 512,
  snap = {
    term = 1,
    data = 0xaafb10,
    eid = {
      parts = {1838862864, 1569866528, 2969429118, 3021055395}
    },
    servers = 0xaafa70,
    election_timer = 1000
  },
  role = RAFT_FOLLOWER,
  commit_index = 311,
  last_applied = 311,
  leader_sid = {
    parts = {642765114, 43797788, 2533161504, 3088745929}
  },
  election_base = 6043283367,
  election_timeout = 6043284593,
  joining = false,
  remote_addresses = {
    map = {
      buckets = 0xa87410,
      one = 0xa879c0,
      mask = 0,
      n = 1
    }
  },
  join_timeout = 6037634820,
  leaving = false,
  left = false,
  leave_timeout = 0,
  failed = false,
  waiters = {
    prev = 0xa87448,
    next = 0xa87448
  },
  listener = 0xaafad0,
  listen_backoff = -9223372036854775808,
  conns = {
    prev = 0xbcd660,
    next = 0xaafc20
  },
  add_servers = {
    buckets = 0xa87480,
    one = 0x0,
    mask = 0,
    n = 0
  },
  remove_server = 0x0,
  commands = {
    buckets = 0xa874a8,
    one = 0x0,
    mask = 0,
    n = 0
  },
  ping_timeout = 6043283700,
  n_votes = 1,
  candidate_retrying = true,
  had_leader = false,
  ever_had_leader = true
}

Thanks
- Yun

--
You received this message because you are subscribed to the Google Groups 
"ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:ovn-kubernetes%[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to