[
https://issues.apache.org/jira/browse/KUDU-3576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Serbin resolved KUDU-3576.
---------------------------------
Resolution: Fixed
> An NPE thrown in Connection.exceptionCaught() makes the connection to
> corresponding tablet server unusable
> ----------------------------------------------------------------------------------------------------------
>
> Key: KUDU-3576
> URL: https://issues.apache.org/jira/browse/KUDU-3576
> Project: Kudu
> Issue Type: Bug
> Components: client, java
> Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0
> Reporter: Alexey Serbin
> Priority: Major
> Fix For: 1.18.0, 1.17.1
>
>
> If a Kudu Java client application keeps a connection to a tablet server open
> and the tablet server is killed/restarted or a network error happens on the
> connection, the client application might end up in a state when it cannot
> communicate with the tablet server even after the tablet server is up and
> running again. If the application tries to write to any tablet replica that
> is hosted at the tablet server, all such requests will timeout on the very
> first attempt, and the state of the connection to the server remains in a
> limbo since then. The only way to get out of the trouble is to recreate the
> affected Java Kudu client instance, e.g., by restarting the application.
> More details are below.
> Once the NPE is thrown by {{Connection.exceptionCaught()}} upon an attempt to
> access null {{ctx}} variable of the {{ChannelHandlerContext}} type, all the
> subsequent attempts to send Write RPC to any tablet replica hosted at the
> tablet server end up with a timeout on a very first attempt (i.e. there are
> no retries):
> {noformat}
> java.lang.RuntimeException: PendingErrors overflowed. Failed to write at
> least 1000 rows to Kudu; Sample errors: Timed out: cannot complete before
> timeout: Batch{operations=1000, tablet="f4c271e3b0d74d5bb6b45ea06987f395"
> [0x0000000B8134D82B, 0x0000000B8134D82C), ignoredErrors=[],
> rpc=KuduRpc(method=Write, tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
> Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
> complete before timeout: Batch{operations=1000,
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
> Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
> complete before timeout: Batch{operations=1000,
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
> Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
> complete before timeout: Batch{operations=1000,
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
> Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}Timed out: cannot
> complete before timeout: Batch{operations=1000,
> tablet="f4c271e3b0d74d5bb6b45ea06987f395" [0x0000000B8134D82B,
> 0x0000000B8134D82C), ignoredErrors=[], rpc=KuduRpc(method=Write,
> tablet=f4c271e3b0d74d5bb6b45ea06987f395, attempt=1,
> TimeoutTracker(timeout=30000, elapsed=30018), Trace Summary(0 ms): Sent(1),
> Received(0), Delayed(0), MasterRefresh(0), AuthRefresh(0), Truncated: false
> Sent: (62388262f255417f8cdcbefee23f8027, [ Write, 1 ]))}
> {noformat}
> The root cause of the problem manifests itself as an NPE in
> {{Connection.exceptionCaught()}} with a stack trace like below:
> {noformat}
> 24/04/27 13:07:18 WARN DefaultPromise: An exception was thrown by
> org.apache.kudu.client.Connection$1.operationComplete()
> java.lang.NullPointerException
> at org.apache.kudu.client.Connection.exceptionCaught(Connection.java:434)
> at
> org.apache.kudu.client.Connection$1.operationComplete(Connection.java:746)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
> at
> org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
> at
> org.apache.kudu.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
> at
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
> at
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
> at
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
> at
> org.apache.kudu.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
> at
> org.apache.kudu.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
> at
> org.apache.kudu.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The issue was introduced with KUDU-1438 in changelist
> [57dda5d48|https://github.com/apache/kudu/commit/57dda5d4868d29f68de4aa0ac516ca390333e6be].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)