[ 
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282541#comment-15282541
 ] 

zhangsong commented on KUDU-1449:
---------------------------------

To follow up last comment:

According to the log ,timeline should be
1 there are three replica in raft config, 1 leader , 2 follower.
2 follower 68d67ae4aaf44280977c6e65c7be3563  lost connection with leader and 
issue a leader election with 670224208cc44a118fd96239f50db724 but get denied 
vote.
3 leader efc6b0312c4645b694f34d9d40f75ddf think the follower 
68d67ae4aaf44280977c6e65c7be3563   is dead , issue a new raft configuration.
4 leader efc6b0312c4645b694f34d9d40f75ddf's consensus get commited.
5 raft config become 2 replica.
6 leader lost connection with everyone.
7 master issue addServer rpc but failed due to connection torn down, it tries 
many times.
8 follower 670224208cc44a118fd96239f50db724  notice leader is down and issue a 
leader election and tried forever.

in 8 as there are only 2 replica in raft config, so follower will become a 
leader by no means.
To solve it , 1 master do some extra when found leader crash instead of issue 
addServer forever; 2 follower should contact master about abornal raft 
configure .

> tablet unavailable caused by  follower can not upgrade to leader.
> -----------------------------------------------------------------
>
>                 Key: KUDU-1449
>                 URL: https://issues.apache.org/jira/browse/KUDU-1449
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.8.0
>         Environment: jd.com production env
>            Reporter: zhangsong
>            Priority: Critical
>
> 1 background : there is 5 node crash due to sys oom today , according to raft 
> protocol, kudu should select follower and upgrade it to leader and provide 
> service again,while it did not.  
> Found such error when issuing query via impala: "Unable to open scanner: 
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string 
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32 
> county_id=-2147483648, int32 city_id=-2147483648, int32 
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed 
> out after deadline expired
> "  
> 2 analysis:
> According to the bucket# , found the target tablet only has two 
> replicas,which is odd. Meantime the tablet-server hosting the leader replica 
> has crashed. 
> The follower can not upgrade to leader in that situation: only one leader and 
> one follower ,leader dead, follower can not get majority of votes for its 
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left 
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader 
> replica,  Observed that the leader replica become follower and previous 
> follower replica become leader, another follower replica is created and there 
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica 
> in raft-configuration: one leader and one follower, and contact master to 
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to