Alexey Serbin created KUDU-3576:
-----------------------------------
Summary: An NPE thrown in Connection.exceptionCaught() makes the
connection to corresponding tablet server unusable
Key: KUDU-3576
URL: https://issues.apache.org/jira/browse/KUDU-3576
Project: Kudu
Issue Type: Bug
Components: client, java
Affects Versions: 1.17.0, 1.16.0, 1.15.0, 1.14.0, 1.13.0, 1.12.0
Reporter: Alexey Serbin
Fix For: 1.18.0, 1.17.1
If a Kudu Java client application keeps a connection to a tablet server open
and the tablet server is killed/restarted or a network error happens on the
connection, the client application might end up in a state when it cannot
communicate with the tablet server even after the tablet server is up and
running again. If the application tries to write to any tablet replica that is
hosted at the tablet server, all such requests will timeout on the very first
attempt, and the state of the connection to the server remains in a limbo since
then. The only way to get out of the trouble is to recreate the affected Java
Kudu client instance, e.g., by restarting the application.
More details are below.
Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to
access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the
subsequent attempts to send Write RPC to any tablet replica hosted at the
tablet server end up with a timeout on a very first attempt (i.e. there are no
retries):
{noformat}
java.lang.RuntimeException: PendingErrors overflowed. Failed to write at least
1000 rows to Kudu; Sample errors: Timed out: cannot complete before timeout:
Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395"
[0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[],
rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
complete before timeout: Batch{operations=1000,
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
complete before timeout: Batch{operations=1000,
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
complete before timeout: Batch{operations=1000,
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
complete before timeout: Batch{operations=1000,
tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}
{noformat}
The root cause of the problem manifests itself as an NPE in
{{Connection.exceptionCaught()}} with a stack trace like below:
{noformat}
24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by
org.apache.kudu.client.Connection$1.operationComplete()
java.lang.NullPointerException
at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434)
at org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746)
at
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
at
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
at
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at
org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at
org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
at
org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
at
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
at
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at
org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at
org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
at
org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
The issue was introduced with KUDU-1438 in changelist
[57dda5d48|https://github.com/apache/kudu/commit/57dda5d4868d29f68de4aa0ac516ca390333e6be].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)