[
https://issues.apache.org/jira/browse/KUDU-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622657#comment-17622657
]
ASF subversion and git services commented on KUDU-3169:
-------------------------------------------------------
Commit 20be4ede1e0dc2f74d583373a3a0e3062529c6fe in kudu's branch
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=20be4ede1 ]
[tserver] validate scanner TTL vs RPC connection timeout
This patch adds a group validator for --scanner_ttl_ms and
--rpc_default_keepalive_time_ms flags. The validator outputs a warning
(not an error though) if the TTL for an idle scanner is greater than the
timeout for an idle RPC connection.
Even if an idle scanner is kept alive at the server side for some time,
Kudu servers periodically close connections that have been idle at least
for --rpc_default_keepalive_time_ms time interval. So, setting the
--rpc_default_keepalive_time_ms flag to a greater or equal value than
--scanner_ttl_ms helps keeping yet-to-be-used connections to idle
scanners open, avoiding inadvertent closure and re-opening connections
to scanners that might yet be sent continuation scan requests. The new
constraint also helps to work around one particular bug [1] in the Kudu
Java client.
I didn't add a test for the newly added validator, but I manually
verified that the warning is output as expected when necessary.
[1] https://issues.apache.org/jira/browse/KUDU-3169
Change-Id: If1439dfb6eb82ba2be0472547b04e5a692879535
Reviewed-on: http://gerrit.cloudera.org:8080/19152
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Yuqi Du <[email protected]>
Reviewed-by: Yingchun Lai <[email protected]>
> kudu java client throws scanner expired error while processing large scan on
> High-load cluster
> -----------------------------------------------------------------------------------------------
>
> Key: KUDU-3169
> URL: https://issues.apache.org/jira/browse/KUDU-3169
> Project: Kudu
> Issue Type: Bug
> Components: client, java
> Affects Versions: 1.8.0, 1.9.0, 1.10.0, 1.10.1, 1.11.0, 1.12.0, 1.11.1
> Reporter: mintao
> Priority: Major
> Labels: scalability, stability
>
> user submits a spark task to scan a kudu table with large amount records,
> after just few minutes the job failed after 4 attempts, each attempt failed
> with error :
> {code:java}
> org.apache.kudu.client.NonRecoverableException: Scanner
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired)
> org.apache.kudu.client.NonRecoverableException: Scanner
> 4e34e6f821be42b889022ec681e235cc not found (it may have expired) at
> org.apache.kudu.client.KuduException.transformException(KuduException.java:110)
> at
> org.apache.kudu.client.KuduClient.joinAndHandleException(KuduClient.java:402)
> at org.apache.kudu.client.KuduScanner.nextRows(KuduScanner.java:57) at
> org.apache.kudu.spark.kudu.RowIterator.hasNext(KuduRDD.scala:153) at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source) at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at
> org.apache.spark.scheduler.Task.run(Task.scala:109) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748) Suppressed:
> org.apache.kudu.client.KuduException$OriginalException: Original asynchronous
> stack trace at
> org.apache.kudu.client.RpcProxy.dispatchTSError(RpcProxy.java:341) at
> org.apache.kudu.client.RpcProxy.responseReceived(RpcProxy.java:263) at
> org.apache.kudu.client.RpcProxy.access$000(RpcProxy.java:59) at
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:152) at
> org.apache.kudu.client.RpcProxy$1.call(RpcProxy.java:148) at
> org.apache.kudu.client.Connection.messageReceived(Connection.java:391) at
> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at org.apache.kudu.client.Connection.handleUpstream(Connection.java:243) at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> at
> org.apache.kudu.shaded.org.jboss.netty.handler.timeout.ReadTimeoutHandler.messageReceived(ReadTimeoutHandler.java:184)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
> at
> org.apache.kudu.shaded.org.jboss.netty.handler.codec.oneone.OneToOneDecoder.handleUpstream(OneToOneDecoder.java:70)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
> at
> org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
> at
> org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
> at
> org.apache.kudu.shaded.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> at
> org.apache.kudu.shaded.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> at
> org.apache.kudu.shaded.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> at
> org.apache.kudu.shaded.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> ... 3 more{code}
> Each task ran just for about 19 seconds then throws scanner not found error
> while tserver uses a default scanner_ttl_ms (60s).In tserver log, We found
> the scanner that memtioned in client log expired after spark job failed, and
> another tserver receives the scan request with that scannerId specifies.
> it seems AsyncKuduScanner in kudu java client will choose a random server
> when retrying scanNextRows, even though the AsyncKuduScanner already has a
> scannerId.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)