[jira] [Commented] (KUDU-1449) tablet unavailable caused by follower can not upgrade to leader.

zhangsong (JIRA) Mon, 16 May 2016 00:34:43 -0700

    [ 
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284230#comment-15284230
 ]


zhangsong commented on KUDU-1449:
---------------------------------

yes, step 5 and step 6 is almost simultaneous .
About KUDU-1097 , if implemented, timeline would be changed to :
1 there are three replica in raft config, 1 leader , 2 follower.
2 follower 68d67ae4aaf44280977c6e65c7be3563 lost connection with leader 
3 leader issue notified master
4 master chose a new node and set status to PRE_VOTER, when it has caught up, 
set it as VOTER.
5 when added node is ready , remove follower 68d67ae4aaf44280977c6e65c7be3563 .
During the whole process, if leader crash , new added node will never caught up 
and be abandoned , raft config is still 3-replica and 2 follower alive, one of 
the follower can be promoted to leader and the tablet can serve write after 
that.
if my understand is right , this issue can be set as duplicate with KUDU-1097.

> tablet unavailable caused by  follower can not upgrade to leader.
> -----------------------------------------------------------------
>
>                 Key: KUDU-1449
>                 URL: https://issues.apache.org/jira/browse/KUDU-1449
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.8.0
>         Environment: jd.com production env
>            Reporter: zhangsong
>            Priority: Critical
>
> 1 background : there is 5 node crash due to sys oom today , according to raft 
> protocol, kudu should select follower and upgrade it to leader and provide 
> service again,while it did not.  
> Found such error when issuing query via impala: "Unable to open scanner: 
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string 
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32 
> county_id=-2147483648, int32 city_id=-2147483648, int32 
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed 
> out after deadline expired
> "  
> 2 analysis:
> According to the bucket# , found the target tablet only has two 
> replicas,which is odd. Meantime the tablet-server hosting the leader replica 
> has crashed. 
> The follower can not upgrade to leader in that situation: only one leader and 
> one follower ,leader dead, follower can not get majority of votes for its 
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left 
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader 
> replica,  Observed that the leader replica become follower and previous 
> follower replica become leader, another follower replica is created and there 
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica 
> in raft-configuration: one leader and one follower, and contact master to 
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1449) tablet unavailable caused by follower can not upgrade to leader.

Reply via email to