zhangsong created KUDU-1449:
-------------------------------

             Summary: tablet unavailable caused by  follower can not upgrade to 
leader.
                 Key: KUDU-1449
                 URL: https://issues.apache.org/jira/browse/KUDU-1449
             Project: Kudu
          Issue Type: Bug
         Environment: jd.com production env
            Reporter: zhangsong
            Priority: Critical


1 background : there is 5 node crash due to sys oom today , according to raft 
protocol, kudu should select follower and upgrade it to leader and provide 
service again,while it did not.  
Found such error when issuing query via impala: "Unable to open scanner: Timed 
out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string memberid=, 
int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 
cate1_id=-2147483648, int32 chan_type=-2147483648, int32 county_id=-2147483648, 
int32 city_id=-2147483648, int32 province_id=-2147483648, 1) failed: timed out 
after deadline expired: timed out after deadline expired
"  

2 analysis:
According to the bucket# , found the target tablet only has two replicas,which 
is odd. Meantime the tablet-server hosting the leader replica has crashed. 
The follower can not upgrade to leader in that situation: only one leader and 
one follower ,leader dead, follower can not get majority of votes for its 
upgrading to leader(as only itself votes for itself).
Thus result in the unavailability of tablet while there is a follower left 
hosting the replica.

After restart kudu-server on the node which hosting the previous leader 
replica,  Observed that the leader replica become follower and previous 
follower replica become leader, another follower replica is created and there 
is 3-replica raft-configuration again.
3 modifications:
follower should notice the abnormal situation where there is only two replica 
in raft-configuration: one leader and one follower, and contact master to 
correct it.
4 to do:
what cause the two-replica raft-configuration is still known.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to