[ 
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282530#comment-15282530
 ] 

zhangsong commented on KUDU-1449:
---------------------------------

Below are some fragments of log from leader, follower and master log :
first  leader part of log :

1 first leader found one of it follower is unreachable more than 300 sec:

I0511 11:13:35.938930  4055 raft_consensus.cc:650] T 
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf: Attempting 
to remove follower 68d67ae4aaf44280977c6e65c7be3563 from the Raft config. 
Reason: Leader has been unable to successfully communicate with Peer 
68d67ae4aaf44280977c6e65c7be3563 for more than 300 seconds (300.434s)
...
W0511 11:13:36.939457  6638 consensus_peers.cc:326] T 
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf -> Peer 
68d67ae4aaf44280977c6e65c7be3563 (follower_ip:7052): Couldn't send request to 
peer 68d67ae4aaf44280977c6e65c7be3563 for tablet 
1a6607ffb4d343cab71e2a1f33a18b24. Status: Timed out: UpdateConsensus RPC to 
follower_ip:7052 timed out after 1.000s. Retrying in the next heartbeat period. 
Already tried 201 times.

2 leader begin to consensus about remove the follower, and it is  commited:
I0511 11:13:36.940100  4055 consensus_queue.cc:145] T 
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf [LEADER]: 
Queue going to LEADER mode. State: All replicated op: 0.0, Majority replicated 
op: 52.1877140, Committed index: 52.1877140, Last appended: 52.1877140, Current 
term: 52, Majority size: 2, State: 1, Mode: LEADER, active raft config: local: 
false peers { permanent_uuid: "efc6b0312c4645b694f34d9d40f75ddf" member_type: 
VOTER last_known_addr { host: "leader_ip" port: 7052 } } peers { 
permanent_uuid: "670224208cc44a118fd96239f50db724" member_type: VOTER 
last_known_addr { host: "another_follower" port: 7052 } }
I0511 11:13:36.941148  4055 consensus_queue.cc:572] T 
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf [LEADER]: 
Connected to new peer: Peer: 670224208cc44a118fd96239f50db724, Is new: false, 
Last received: 52.1877140, Next index: 1877141, Last known committed idx: 
1877140, Last exchange result: ERROR, Needs remote bootstrap: false
I0511 11:13:36.942076 11206 raft_consensus_state.cc:605] T 
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf [term 52 
LEADER]: Committing config change with OpId 52.1877141. Old config: { 
opid_index: -1 local: false peers { permanent_uuid: 
"68d67ae4aaf44280977c6e65c7be3563" member_type: VOTER last_known_addr { host: 
"follower_ip" port: 7052 } } peers { permanent_uuid: 
"efc6b0312c4645b694f34d9d40f75ddf" member_type: VOTER last_known_addr { host: 
"leader_ip" port: 7052 } } peers { permanent_uuid: 
"670224208cc44a118fd96239f50db724" member_type: VOTER last_known_addr { host: 
"another_follower" port: 7052 } } }. New config: { opid_index: 1877141 local: 
false peers { permanent_uuid: "efc6b0312c4645b694f34d9d40f75ddf" member_type: 
VOTER last_known_addr { host: "leader_ip" port: 7052 } } peers { 
permanent_uuid: "670224208cc44a118fd96239f50db724" member_type: VOTER 
last_known_addr { host: "another_follower" port: 7052 } } }

log from another_follower_ip node: 
1 another_follower_ip has the log commit above consensus:
I0511 11:13:36.941264 27518 raft_consensus.cc:831] T 
1a6607ffb4d343cab71e2a1f33a18b24 P 670224208cc44a118fd96239f50db724 [term 52 
FOLLOWER]: Refusing update from remote peer efc6b0312c4645b694f34d9d40f75ddf: 
Log matching property violated. Preceding OpId in replica: term: 5
2 index: 1877140. Preceding OpId from leader: term: 52 index: 1877141. (index 
mismatch)
"I0511 11:13:36.943799 27530 raft_consensus_state.cc:605] T 
1a6607ffb4d343cab71e2a1f33a18b24 P 670224208cc44a118fd96239f50db724 [term 52 
FOLLOWER]: Committing config change with OpId 52.1877141. Old config: { 
opid_index: -1 local: false peers { permanent_uuid: 
"68d67ae4aaf44280977c6e65c7be3563" member_type: VOTER last_known_addr { host: 
"follower_ip" port: 7052 } } peers { permanent_uuid: 
"efc6b0312c4645b694f34d9d40f75ddf" member_type: VOTER last_known_addr { host: 
"leader_ip" port: 7052 } } peers { permanent_uuid: 
"670224208cc44a118fd96239f50db724" member_type: VOTER last_known_addr { host: 
"another_follower_ip" port: 7052 } } }. New config: { opid_index: 1877141 
local: false peers { permanent_uuid: "efc6b0312c4645b694f34d9d40f75ddf" 
member_type: VOTER last_known_addr { host: "leader_ip" port: 7052 } } peers { 
permanent_uuid: "670224208cc44a118fd96239f50db724" member_type: VOTER 
last_known_addr { host: "another_follower_ip" port: 7052 } } }"

2 also notice that there are some vote deny log :
I0511 11:13:10.256023 27504 raft_consensus.cc:1612] T 
1a6607ffb4d343cab71e2a1f33a18b24 P 670224208cc44a118fd96239f50db724 [term 52 
FOLLOWER]: Leader election vote request: Denying vote to candidate 
68d67ae4aaf44280977c6e65c7be3563 for term 65 because replica is either leader 
or believes a valid leader to be alive.

log from master:
W0511 11:13:37.052328 33208 catalog_manager.cc:1891] TS 
efc6b0312c4645b694f34d9d40f75ddf: AddServer ChangeConfig RPC failed for tablet 
1a6607ffb4d343cab71e2a1f33a18b24: Network error: Client connection negotiation 
failed: client connection to 172.22.99.11:7052: connect: Connection timed out 
(error 110)

> tablet unavailable caused by  follower can not upgrade to leader.
> -----------------------------------------------------------------
>
>                 Key: KUDU-1449
>                 URL: https://issues.apache.org/jira/browse/KUDU-1449
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.8.0
>         Environment: jd.com production env
>            Reporter: zhangsong
>            Priority: Critical
>
> 1 background : there is 5 node crash due to sys oom today , according to raft 
> protocol, kudu should select follower and upgrade it to leader and provide 
> service again,while it did not.  
> Found such error when issuing query via impala: "Unable to open scanner: 
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string 
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32 
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32 
> county_id=-2147483648, int32 city_id=-2147483648, int32 
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed 
> out after deadline expired
> "  
> 2 analysis:
> According to the bucket# , found the target tablet only has two 
> replicas,which is odd. Meantime the tablet-server hosting the leader replica 
> has crashed. 
> The follower can not upgrade to leader in that situation: only one leader and 
> one follower ,leader dead, follower can not get majority of votes for its 
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left 
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader 
> replica,  Observed that the leader replica become follower and previous 
> follower replica become leader, another follower replica is created and there 
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica 
> in raft-configuration: one leader and one follower, and contact master to 
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to