[
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284230#comment-15284230
]
zhangsong commented on KUDU-1449:
---------------------------------
yes, step 5 and step 6 is almost simultaneous .
About KUDU-1097 , if implemented, timeline would be changed to :
1 there are three replica in raft config, 1 leader , 2 follower.
2 follower 68d67ae4aaf44280977c6e65c7be3563 lost connection with leader
3 leader issue notified master
4 master chose a new node and set status to PRE_VOTER, when it has caught up,
set it as VOTER.
5 when added node is ready , remove follower 68d67ae4aaf44280977c6e65c7be3563 .
During the whole process, if leader crash , new added node will never caught up
and be abandoned , raft config is still 3-replica and 2 follower alive, one of
the follower can be promoted to leader and the tablet can serve write after
that.
if my understand is right , this issue can be set as duplicate with KUDU-1097.
> tablet unavailable caused by follower can not upgrade to leader.
> -----------------------------------------------------------------
>
> Key: KUDU-1449
> URL: https://issues.apache.org/jira/browse/KUDU-1449
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.8.0
> Environment: jd.com production env
> Reporter: zhangsong
> Priority: Critical
>
> 1 background : there is 5 node crash due to sys oom today , according to raft
> protocol, kudu should select follower and upgrade it to leader and provide
> service again,while it did not.
> Found such error when issuing query via impala: "Unable to open scanner:
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32
> county_id=-2147483648, int32 city_id=-2147483648, int32
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed
> out after deadline expired
> "
> 2 analysis:
> According to the bucket# , found the target tablet only has two
> replicas,which is odd. Meantime the tablet-server hosting the leader replica
> has crashed.
> The follower can not upgrade to leader in that situation: only one leader and
> one follower ,leader dead, follower can not get majority of votes for its
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader
> replica, Observed that the leader replica become follower and previous
> follower replica become leader, another follower replica is created and there
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica
> in raft-configuration: one leader and one follower, and contact master to
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)