Redriver created KUDU-3349:
------------------------------
Summary: Kudu java client failed to demote leader and caused a lot
of deleting rows timeout
Key: KUDU-3349
URL: https://issues.apache.org/jira/browse/KUDU-3349
Project: Kudu
Issue Type: Bug
Reporter: Redriver
During deleting rows through Spark, I found a lot of PendingErrors which caused
timeout, the deleting takes very long time, and finally failed sometimes.
{code:java}
java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least
1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout:
Batch{operations=100, tablet="b037e2e266b44c4c95da4065a8d5b719" [0x00000006,
0x00000007), ignoredErrors=[], rpc=KuduRpc(method=Write,
tablet=b037e2e266b44c4c95da4065a8d5b719, attempt=23,
TimeoutTracker(timeout=30000, elapsed=26852), Traces: [0ms] sending RPC to
server <ByteString@1d59316e size=32
contents="c51a7275257240b7a8c7e99d0895ae89">, [2ms] delaying RPC due to:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [2ms] received response from server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [10ms] sending RPC to server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">,
[12ms] delaying RPC due to: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (error 0), [12ms] received response from server <ByteString@1d59316e
size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (error 0), [30ms] sending RPC to server <ByteString@1d59316e size=32
contents="c51a7275257240b7a8c7e99d0895ae89">, [32ms] delaying RPC due to:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [32ms] received response from server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [51ms] sending RPC to server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">,
[52ms] delaying RPC due to: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (error 0), [52ms] received response from server <ByteString@1d59316e
size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (error 0), [70ms] sending RPC to server <ByteString@1d59316e size=32
contents="c51a7275257240b7a8c7e99d0895ae89">, [72ms] delaying RPC due to:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [72ms] received response from server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [90ms] sending RPC to server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">,
[92ms] delaying RPC due to: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (error 0), [92ms] received response from server <ByteString@1d59316e
size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (error 0), [170ms] sending RPC to server <ByteString@1d59316e size=32
contents="c51a7275257240b7a8c7e99d0895ae89">, [172ms] delaying RPC due to:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [172ms] received response from server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">:
Illegal state: replica c51a7275257240b7a8c7e99d0895ae89 is not leader of this
config: current role FOLLOWER (error 0), [190ms] sending RPC to server
<ByteString@1d59316e size=32 contents="c51a7275257240b7a8c7e99d0895ae89">,
[192ms] delaying RPC due to: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (error 0), [192ms] received response from server <ByteString@1d59316e
size=32 contents="c51a7275257240b7a8c7e99d0895ae89">: Illegal state: replica
c51a7275257240b7a8c7e99d0895ae89 is not leader of this config: current role
FOLLOWER (erro
...
{code}
Anyway, I think I found the RCA.
After enabling DEBUG level logging, I found something interesting. The UUID is
in fact the same, but kudu-client thinks they are not. See the below.
{code:java}
22/01/19 23:34:34,621 DEBUG [kudu-nio-3] client.RemoteTablet:170 :
<ByteString@6dffd497 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437">
wasn't the leader for 44fa35c99e7042329bbfa0268c1cd4de, current leader is
<ByteString@5aadafd0 size=32 contents="fc07f681d3ea4bab9bc5ec8090ab9437">
{code}
This issue caused the kudu-client to fail to demote the leader.
{code:java}
void demoteLeader(String uuid) {
synchronized (tabletServers) {
if (leaderUuid == null) {
LOG.debug("{} couldn't be demoted as the leader for {}, there is no
known leader",
uuid, getTabletId());
return;
}
if (leaderUuid.equals(uuid)) {
leaderUuid = null;
LOG.debug("{} was demoted as the leader for {}", uuid, getTabletId());
} else {
LOG.debug("{} wasn't the leader for {}, current leader is {}", uuid,
getTabletId(), leaderUuid);
}
}
}
{code}
I take some time to debug, and finally found the tserver's uuid is generated by
the "serverMetadataPB.getUuid().toString()".
[https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java#L246]
The correct way to get Uuid() is "serverMetadataPB.getUuid().toStringUtf8()"
After fixing this bug, the deleting becomes faster than before because the
client will not send write to the wrong leader.
I'll submit a patch for this issue.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)