chaiyongqiang created FLINK-16626:
-------------------------------------
Summary: Exception Encountered when cancelling a job in yarn
per-job mode
Key: FLINK-16626
URL: https://issues.apache.org/jira/browse/FLINK-16626
Project: Flink
Issue Type: Bug
Components: Deployment / YARN
Affects Versions: 1.10.0
Reporter: chaiyongqiang
When executing the following command to stop a flink job with yarn per-job
mode, the client keeps retrying untill timeout (1 minutes)and exit with
failure. But the job stops successfully.
Command :
{noformat}
flink cancel $jobId yid appId
{noformat}
The exception on the client side is :
{quote}bq.
2020-03-17 12:32:13,709 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:13,749 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.ResourceLeakDetector -
-Dorg.apache.flink.shaded.netty4.io.netty.leakDetection.level: simple
2020-03-17 12:32:13,749 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.ResourceLeakDetector -
-Dorg.apache.flink.shaded.netty4.io.netty.leakDetection.targetRecords: 4
2020-03-17 12:32:13,759 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf -
-Dorg.apache.flink.shaded.netty4.io.netty.buffer.checkAccessible: true
2020-03-17 12:32:13,759 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf -
-Dorg.apache.flink.shaded.netty4.io.netty.buffer.checkBounds: true
2020-03-17 12:32:13,760 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.ResourceLeakDetectorFactory -
Loaded default ResourceLeakDetector:
org.apache.flink.shaded.netty4.io.netty.util.ResourceLeakDetector@7cb68452
2020-03-17 12:32:13,786 DEBUG
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelId -
-Dio.netty.processId: 156764 (auto-detected)
2020-03-17 12:32:13,788 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.NetUtil -
-Djava.net.preferIPv4Stack: false
2020-03-17 12:32:13,788 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.NetUtil -
-Djava.net.preferIPv6Addresses: false
2020-03-17 12:32:13,791 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.NetUtil - Loopback interface: lo
(lo, 0:0:0:0:0:0:0:1%lo)
2020-03-17 12:32:13,791 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.NetUtil -
/proc/sys/net/core/somaxconn: 128
2020-03-17 12:32:13,793 DEBUG
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelId -
-Dio.netty.machineId: 02:42:dd:ff:fe:50:e5:91 (auto-detected)
2020-03-17 12:32:13,820 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.numHeapArenas: 16
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.numDirectArenas: 16
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.pageSize: 8192
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.maxOrder: 11
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.chunkSize: 16777216
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.tinyCacheSize: 512
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.smallCacheSize: 256
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.normalCacheSize: 64
2020-03-17 12:32:13,821 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.maxCachedBufferCapacity: 32768
2020-03-17 12:32:13,822 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.cacheTrimInterval: 8192
2020-03-17 12:32:13,822 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.cacheTrimIntervalMillis: 0
2020-03-17 12:32:13,822 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.useCacheForAllThreads: true
2020-03-17 12:32:13,822 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator -
-Dio.netty.allocator.maxCachedByteBuffersPerChunk: 1023
2020-03-17 12:32:13,830 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.ByteBufUtil -
-Dio.netty.allocator.type: pooled
2020-03-17 12:32:13,830 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.ByteBufUtil -
-Dio.netty.threadLocalDirectBufferSize: 0
2020-03-17 12:32:13,830 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.ByteBufUtil -
-Dio.netty.maxThreadLocalCharBufferSize: 16384
2020-03-17 12:32:13,891 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.Recycler -
-Dio.netty.recycler.maxCapacityPerThread: 4096
2020-03-17 12:32:13,891 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.Recycler -
-Dio.netty.recycler.maxSharedCapacityFactor: 2
2020-03-17 12:32:13,891 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.Recycler -
-Dio.netty.recycler.linkCapacity: 16
2020-03-17 12:32:13,892 DEBUG
org.apache.flink.shaded.netty4.io.netty.util.Recycler -
-Dio.netty.recycler.ratio: 8
2020-03-17 12:32:16,977 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:19,984 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:22,989 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:25,994 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:28,999 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:32,003 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:35,008 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:38,012 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:41,017 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:44,021 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:47,027 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:50,031 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:53,036 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:56,040 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:32:59,044 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:33:02,050 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:33:05,054 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:33:08,059 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:33:11,065 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient -
Shutting down rest endpoint.
2020-03-17 12:33:14,070 DEBUG org.apache.flink.runtime.rest.RestClient -
Sending request of class class
org.apache.flink.runtime.rest.messages.EmptyRequestBody to
ocdt31.aicloud.local:51231/v1/jobs/cc61033484d4c0e7a27a8a2a36f4de7a?mode=cancel
2020-03-17 12:33:14,077 DEBUG
org.apache.flink.shaded.netty4.io.netty.buffer.PoolThreadCache - Freed 2
thread-local buffer(s) from thread: flink-rest-client-netty-thread-1
2020-03-17 12:33:14,080 DEBUG org.apache.flink.runtime.rest.RestClient - Rest
endpoint shutdown complete.
2020-03-17 12:33:14,083 ERROR org.apache.flink.client.cli.CliFrontend - Error
while running the command.
org.apache.flink.util.FlinkException: Could not cancel job
cc61033484d4c0e7a27a8a2a36f4de7a.
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
... 9 more
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.util.FlinkException: Could not cancel job
cc61033484d4c0e7a27a8a2a36f4de7a.
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:904)
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at
org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
... 9 more
{quote}
Actually, the job was cancelled. But the server also prints some exception:
{quote}2020-03-17 12:25:13,754 ERROR [flink-akka.actor.default-dispatcher-17]
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:766)
- Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:855)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:340)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:333)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:766)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:764)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:421)
at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:149)
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95)
at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30)
at
org.apache.flink.runtime.rest.handler.util.HandlerUtils.sendResponse(HandlerUtils.java:224)
at
org.apache.flink.runtime.rest.handler.util.HandlerUtils.sendResponse(HandlerUtils.java:176)
at
org.apache.flink.runtime.rest.handler.util.HandlerUtils.sendResponse(HandlerUtils.java:91)
at
org.apache.flink.runtime.rest.handler.AbstractRestHandler.lambda$respondToRequest$0(AbstractRestHandler.java:78)
at
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:656) at
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
at
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:874)
at akka.dispatch.OnComplete.internal(Future.scala:264) at
akka.dispatch.OnComplete.internal(Future.scala:261) at
akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) at
akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) at
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) at
akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
at
akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436) at
scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435) at
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at
akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90) at
akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107){quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)