[
https://issues.apache.org/jira/browse/KUDU-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282530#comment-15282530
]
zhangsong commented on KUDU-1449:
---------------------------------
Below are some fragments of log from leader, follower and master log :
first leader part of log :
1 first leader found one of it follower is unreachable more than 300 sec:
I0511 11:13:35.938930 4055 raft_consensus.cc:650] T
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf: Attempting
to remove follower 68d67ae4aaf44280977c6e65c7be3563 from the Raft config.
Reason: Leader has been unable to successfully communicate with Peer
68d67ae4aaf44280977c6e65c7be3563 for more than 300 seconds (300.434s)
...
W0511 11:13:36.939457 6638 consensus_peers.cc:326] T
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf -> Peer
68d67ae4aaf44280977c6e65c7be3563 (follower_ip:7052): Couldn't send request to
peer 68d67ae4aaf44280977c6e65c7be3563 for tablet
1a6607ffb4d343cab71e2a1f33a18b24. Status: Timed out: UpdateConsensus RPC to
follower_ip:7052 timed out after 1.000s. Retrying in the next heartbeat period.
Already tried 201 times.
2 leader begin to consensus about remove the follower, and it is commited:
I0511 11:13:36.940100 4055 consensus_queue.cc:145] T
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf [LEADER]:
Queue going to LEADER mode. State: All replicated op: 0.0, Majority replicated
op: 52.1877140, Committed index: 52.1877140, Last appended: 52.1877140, Current
term: 52, Majority size: 2, State: 1, Mode: LEADER, active raft config: local:
false peers { permanent_uuid: "efc6b0312c4645b694f34d9d40f75ddf" member_type:
VOTER last_known_addr { host: "leader_ip" port: 7052 } } peers {
permanent_uuid: "670224208cc44a118fd96239f50db724" member_type: VOTER
last_known_addr { host: "another_follower" port: 7052 } }
I0511 11:13:36.941148 4055 consensus_queue.cc:572] T
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf [LEADER]:
Connected to new peer: Peer: 670224208cc44a118fd96239f50db724, Is new: false,
Last received: 52.1877140, Next index: 1877141, Last known committed idx:
1877140, Last exchange result: ERROR, Needs remote bootstrap: false
I0511 11:13:36.942076 11206 raft_consensus_state.cc:605] T
1a6607ffb4d343cab71e2a1f33a18b24 P efc6b0312c4645b694f34d9d40f75ddf [term 52
LEADER]: Committing config change with OpId 52.1877141. Old config: {
opid_index: -1 local: false peers { permanent_uuid:
"68d67ae4aaf44280977c6e65c7be3563" member_type: VOTER last_known_addr { host:
"follower_ip" port: 7052 } } peers { permanent_uuid:
"efc6b0312c4645b694f34d9d40f75ddf" member_type: VOTER last_known_addr { host:
"leader_ip" port: 7052 } } peers { permanent_uuid:
"670224208cc44a118fd96239f50db724" member_type: VOTER last_known_addr { host:
"another_follower" port: 7052 } } }. New config: { opid_index: 1877141 local:
false peers { permanent_uuid: "efc6b0312c4645b694f34d9d40f75ddf" member_type:
VOTER last_known_addr { host: "leader_ip" port: 7052 } } peers {
permanent_uuid: "670224208cc44a118fd96239f50db724" member_type: VOTER
last_known_addr { host: "another_follower" port: 7052 } } }
log from another_follower_ip node:
1 another_follower_ip has the log commit above consensus:
I0511 11:13:36.941264 27518 raft_consensus.cc:831] T
1a6607ffb4d343cab71e2a1f33a18b24 P 670224208cc44a118fd96239f50db724 [term 52
FOLLOWER]: Refusing update from remote peer efc6b0312c4645b694f34d9d40f75ddf:
Log matching property violated. Preceding OpId in replica: term: 5
2 index: 1877140. Preceding OpId from leader: term: 52 index: 1877141. (index
mismatch)
"I0511 11:13:36.943799 27530 raft_consensus_state.cc:605] T
1a6607ffb4d343cab71e2a1f33a18b24 P 670224208cc44a118fd96239f50db724 [term 52
FOLLOWER]: Committing config change with OpId 52.1877141. Old config: {
opid_index: -1 local: false peers { permanent_uuid:
"68d67ae4aaf44280977c6e65c7be3563" member_type: VOTER last_known_addr { host:
"follower_ip" port: 7052 } } peers { permanent_uuid:
"efc6b0312c4645b694f34d9d40f75ddf" member_type: VOTER last_known_addr { host:
"leader_ip" port: 7052 } } peers { permanent_uuid:
"670224208cc44a118fd96239f50db724" member_type: VOTER last_known_addr { host:
"another_follower_ip" port: 7052 } } }. New config: { opid_index: 1877141
local: false peers { permanent_uuid: "efc6b0312c4645b694f34d9d40f75ddf"
member_type: VOTER last_known_addr { host: "leader_ip" port: 7052 } } peers {
permanent_uuid: "670224208cc44a118fd96239f50db724" member_type: VOTER
last_known_addr { host: "another_follower_ip" port: 7052 } } }"
2 also notice that there are some vote deny log :
I0511 11:13:10.256023 27504 raft_consensus.cc:1612] T
1a6607ffb4d343cab71e2a1f33a18b24 P 670224208cc44a118fd96239f50db724 [term 52
FOLLOWER]: Leader election vote request: Denying vote to candidate
68d67ae4aaf44280977c6e65c7be3563 for term 65 because replica is either leader
or believes a valid leader to be alive.
log from master:
W0511 11:13:37.052328 33208 catalog_manager.cc:1891] TS
efc6b0312c4645b694f34d9d40f75ddf: AddServer ChangeConfig RPC failed for tablet
1a6607ffb4d343cab71e2a1f33a18b24: Network error: Client connection negotiation
failed: client connection to 172.22.99.11:7052: connect: Connection timed out
(error 110)
> tablet unavailable caused by follower can not upgrade to leader.
> -----------------------------------------------------------------
>
> Key: KUDU-1449
> URL: https://issues.apache.org/jira/browse/KUDU-1449
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 0.8.0
> Environment: jd.com production env
> Reporter: zhangsong
> Priority: Critical
>
> 1 background : there is 5 node crash due to sys oom today , according to raft
> protocol, kudu should select follower and upgrade it to leader and provide
> service again,while it did not.
> Found such error when issuing query via impala: "Unable to open scanner:
> Timed out: GetTableLocations(flow_first_buy_user_0504, bucket=453, string
> memberid=, int32 cate3_id=-2147483648, int32 cate2_id=-2147483648, int32
> cate1_id=-2147483648, int32 chan_type=-2147483648, int32
> county_id=-2147483648, int32 city_id=-2147483648, int32
> province_id=-2147483648, 1) failed: timed out after deadline expired: timed
> out after deadline expired
> "
> 2 analysis:
> According to the bucket# , found the target tablet only has two
> replicas,which is odd. Meantime the tablet-server hosting the leader replica
> has crashed.
> The follower can not upgrade to leader in that situation: only one leader and
> one follower ,leader dead, follower can not get majority of votes for its
> upgrading to leader(as only itself votes for itself).
> Thus result in the unavailability of tablet while there is a follower left
> hosting the replica.
> After restart kudu-server on the node which hosting the previous leader
> replica, Observed that the leader replica become follower and previous
> follower replica become leader, another follower replica is created and there
> is 3-replica raft-configuration again.
> 3 modifications:
> follower should notice the abnormal situation where there is only two replica
> in raft-configuration: one leader and one follower, and contact master to
> correct it.
> 4 to do:
> what cause the two-replica raft-configuration is still known.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)