zhangsong created KUDU-1449:
-------------------------------
Summary: tablet unavailable caused by follower can not upgrade to
leader.
Key: KUDU-1449
URL: https://issues.apache.org/jira/browse/KUDU-1449
Project: Kudu
Issue Type: Bug
Environment: jd.com production env
Reporter: zhangsong
Priority: Critical
1 background : there is 5 node crash due to sys oom today , according to raft
protocol, kudu should select follower and upgrade it to leader and provide
service again,while it did not.
Found such error when issuing query via impala: "Unable to open scanner: Timed
out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string memberid=,
int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32
cate1_id=-2147483648, int32 chan_type=-2147483648, int32 county_id=-2147483648,
int32 city_id=-2147483648, int32 province_id=-2147483648, 1) failed: timed out
after deadline expired: timed out after deadline expired
"
2 analysis:
According to the bucket# , found the target tablet only has two replicas,which
is odd. Meantime the tablet-server hosting the leader replica has crashed.
The follower can not upgrade to leader in that situation: only one leader and
one follower ,leader dead, follower can not get majority of votes for its
upgrading to leader(as only itself votes for itself).
Thus result in the unavailability of tablet while there is a follower left
hosting the replica.
After restart kudu-server on the node which hosting the previous leader
replica, Observed that the leader replica become follower and previous
follower replica become leader, another follower replica is created and there
is 3-replica raft-configuration again.
3 modifications:
follower should notice the abnormal situation where there is only two replica
in raft-configuration: one leader and one follower, and contact master to
correct it.
4 to do:
what cause the two-replica raft-configuration is still known.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)