[
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069389#comment-17069389
]
YifanZhang commented on KUDU-3082:
----------------------------------
Unfortunately, most logs were cleaned up due to expiration before I want to
analyze them. Now I get partial logs about tablet
7404240f458f462d92b6588d07583a52(full logs on ts26 and partial logs on ts25).
I'll attach them in a moment. The logs on ts27 and the leader master before
ts27 restart are completely cleaned up:( I also keep some fragmented logs on
the master and I'm not sure if it is helpful.
I think the state of ts27 was abnormal when the problem occurs because some
replicas could't communicate with their leader on ts27.
{code:java}
I0313 03:50:14.118202 99494 raft_consensus.cc:1149] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [term 2
LEADER]: Rejecting Update request from peer 47af52df1adc47e1903eb097e9c88f2e
for earlier term 1. Current term is 2. Ops: []
I0313 03:50:14.250483 56182 consensus_queue.cc:984] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]:
Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e"
member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status:
LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx:
55445, Time since last communication: 0.000s
I0313 03:50:14.327806 56430 consensus_queue.cc:984] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]:
Connected to new peer: Peer: permanent_uuid: "d1952499f94a4e6087bee28466fcb09f"
member_type: VOTER last_known_addr { host: "kudu-ts25" port: 14100 }, Status:
LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx:
54648, Time since last communication: 0.000s
I0313 03:50:14.330118 56430 consensus_queue.cc:689] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]:
The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been
garbage collected. The follower will never be able to catch up (Not found:
Failed to read ops 54649..55444: Segment 157 which contained index 54649 has
been GCed)
I0313 03:50:14.330137 56430 consensus_queue.cc:544] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]:
The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been
garbage collected. The replica will never be able to catch up
I0313 03:50:14.335949 99494 consensus_queue.cc:206] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]:
Queue going to LEADER mode. State: All replicated index: 0, Majority replicated
index: 55446, Committed index: 55446, Last appended: 2.55446, Last appended by
leader: 55445, Current term: 2, Majority size: 2, State: 0, Mode: LEADER,
active raft config: opid_index: 55447 OBSOLETE_local: false peers {
permanent_uuid: "7380d797d2ea49e88d71091802fb1c81" member_type: VOTER
last_known_addr { host: "kudu-ts26" port: 14100 } } peers { permanent_uuid:
"47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host:
"kudu-ts27" port: 14100 } }
I0313 03:50:14.336225 56182 consensus_queue.cc:984] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]:
Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e"
member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status:
LMP_MISMATCH, Last received: 0.0, Next index: 55447, Last known committed idx:
55446, Time since last communication: 0.000s
W0313 03:50:14.336508 98349 consensus_peers.cc:458] T
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 -> Peer
47af52df1adc47e1903eb097e9c88f2e (kudu-ts27:14100): Couldn't send request to
peer 47af52df1adc47e1903eb097e9c88f2e. Status: Illegal state: Rejecting Update
request from peer 7380d797d2ea49e88d71091802fb1c81 for term 2. Could not
prepare a single transaction due to: Illegal state: RaftConfig change currently
pending. Only one is allowed at a time.
{code}
Judging from the above logs on ts26, it reject the update request from peer
47af52d and it also send update request to this peer but failed. Maybe it means
the config change operation of replica 47af52d is failed but the pending config
isn't cleared. This case maybe similar to KUDU-1338.
> tablets in "CONSENSUS_MISMATCH" state for a long time
> -----------------------------------------------------
>
> Key: KUDU-3082
> URL: https://issues.apache.org/jira/browse/KUDU-3082
> Project: Kudu
> Issue Type: Bug
> Components: consensus
> Affects Versions: 1.10.1
> Reporter: YifanZhang
> Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck
> output is like:
>
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3
> replicas' active configs disagree with the leader master's
> 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
> d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
> 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
> A = 7380d797d2ea49e88d71091802fb1c81
> B = d1952499f94a4e6087bee28466fcb09f
> C = 47af52df1adc47e1903eb097e9c88f2e
> D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
> master | A B C* | | | Yes
> A | A B C* | 5 | -1 | Yes
> B | A B C | 5 | -1 | Yes
> C | A B C* D~ | 5 | 54649 | No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2
> replicas' active configs disagree with the leader master's
> d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
> 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
> A = d1952499f94a4e6087bee28466fcb09f
> B = 47af52df1adc47e1903eb097e9c88f2e
> C = 5a8aeadabdd140c29a09dabcae919b31
> D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
> master | A B* C | | | Yes
> A | A B* C | 5 | 5 | Yes
> B | A B* C D~ | 5 | 96176 | No
> C | A B* C | 5 | 5 | Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1
> replicas' active configs disagree with the leader master's
> a9eaff3cf1ed483aae849549999d649a (kudu-ts23): RUNNING
> f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
> A = a9eaff3cf1ed483aae849549999d649a
> B = f75df4a6b5ce404884313af5f906b392
> C = 47af52df1adc47e1903eb097e9c88f2e
> D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
> master | A B C* | | | Yes
> A | A B C* | 1 | -1 | Yes
> B | A B C* | 1 | -1 | Yes
> C | A B C* D~ | 1 | 2 | No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3
> replicas' active configs disagree with the leader master's
> 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
> f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
> A = 47af52df1adc47e1903eb097e9c88f2e
> B = f0f7b2f4b9d344e6929105f48365f38e
> C = f75df4a6b5ce404884313af5f906b392
> D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
> Config source | Replicas | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
> master | A* B C | | | Yes
> A | A* B C D~ | 1 | 1991 | No
> B | A* B C | 1 | 4 | Yes
> C | A* B C | 1 | 4 | Yes{code}
> These tablets couldn't recover for a couple of days until we restart
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3
> LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is
> already a config change operation in progress. Unable to promote follower
> until it completes. Doing nothing.
> I0314 04:38:41.751009 65453 raft_consensus.cc:937] T
> 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5
> LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is
> already a config change operation in progress. Unable to promote follower
> until it completes. Doing nothing.
> {code}
> There seems to be some RaftConfig change operations that somehow cannot
> complete.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)