[
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282541#comment-15282541
]
zhangsong commented on KUDU-1449:
---------------------------------
To follow up last comment:
According to the log ,timeline should be
1 there are three replica in raft config, 1 leader , 2 follower.
2 follower 68d67ae4aaf44280977c6e65c7be3563 lost connection with leader and
issue a leader election with 670224208cc44a118fd96239f50db724 but get denied
vote.
3 leader efc6b0312c4645b694f34d9d40f75ddf think the follower
68d67ae4aaf44280977c6e65c7be3563 is dead , issue a new raft configuration.
4 leader efc6b0312c4645b694f34d9d40f75ddf's consensus get commited.
5 raft config become 2 replica.
6 leader lost connection with everyone.
7 master issue addServer rpc but failed due to connection torn down, it tries
many times.
8 follower 670224208cc44a118fd96239f50db724 notice leader is down and issue a
leader election and tried forever.
in 8 as there are only 2 replica in raft config, so follower will become a
leader by no means.
To solve it , 1 master do some extra when found leader crash instead of issue
addServer forever; 2 follower should contact master about abornal raft
configure .
> tablet unavailable caused by follower can not upgrade to leader.
> -----------------------------------------------------------------
>
> Key: KUDU-1449
> URL: https://issues.apache.org/jira/browse/KUDU-1449
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.8.0
> Environment: jd.com production env
> Reporter: zhangsong
> Priority: Critical
>
> 1 background : there is 5 node crash due to sys oom today , according to raft
> protocol, kudu should select follower and upgrade it to leader and provide
> service again,while it did not.
> Found such error when issuing query via impala: "Unable to open scanner:
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32
> county_id=-2147483648, int32 city_id=-2147483648, int32
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed
> out after deadline expired
> "
> 2 analysis:
> According to the bucket# , found the target tablet only has two
> replicas,which is odd. Meantime the tablet-server hosting the leader replica
> has crashed.
> The follower can not upgrade to leader in that situation: only one leader and
> one follower ,leader dead, follower can not get majority of votes for its
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader
> replica, Observed that the leader replica become follower and previous
> follower replica become leader, another follower replica is created and there
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica
> in raft-configuration: one leader and one follower, and contact master to
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)