[jira] [Commented] (KUDU-1449) tablet unavailable caused by follower can not upgrade to leader.

Todd Lipcon (JIRA) Fri, 13 May 2016 09:01:53 -0700

    [ 
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282832#comment-15282832
 ]


Todd Lipcon commented on KUDU-1449:
-----------------------------------

How much time elapsed between step 5 and step 6 above? It seems like what 
you're talking about here is a case where you had two "simultaneous" failures 
of a tablet with 3 replicas. ("simultaneous" within the 5-10 minutes it would 
take to re-replicate). So, I don't think there is any other correct thing that 
the master could do in step 7 or 8 -- the master can only issue a changeconfig 
on a leader, and there is only 1/2 replicas online.

Perhaps the best improvement we could make here is implementing KUDU-1097, 
whicch would change the order such that we would not have evicted the replica 
so eagerly in step 3 (not until there is a new fourth replica available).

> tablet unavailable caused by  follower can not upgrade to leader.
> -----------------------------------------------------------------
>
>                 Key: KUDU-1449
>                 URL: https://issues.apache.org/jira/browse/KUDU-1449
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.8.0
>         Environment: jd.com production env
>            Reporter: zhangsong
>            Priority: Critical
>
> 1 background : there is 5 node crash due to sys oom today , according to raft 
> protocol, kudu should select follower and upgrade it to leader and provide 
> service again,while it did not.  
> Found such error when issuing query via impala: "Unable to open scanner: 
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string 
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32 
> county_id=-2147483648, int32 city_id=-2147483648, int32 
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed 
> out after deadline expired
> "  
> 2 analysis:
> According to the bucket# , found the target tablet only has two 
> replicas,which is odd. Meantime the tablet-server hosting the leader replica 
> has crashed. 
> The follower can not upgrade to leader in that situation: only one leader and 
> one follower ,leader dead, follower can not get majority of votes for its 
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left 
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader 
> replica,  Observed that the leader replica become follower and previous 
> follower replica become leader, another follower replica is created and there 
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica 
> in raft-configuration: one leader and one follower, and contact master to 
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1449) tablet unavailable caused by follower can not upgrade to leader.

Reply via email to