[ 
https://issues.apache.org/jira/browse/FLINK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932185#comment-16932185
 ] 

liupengcheng commented on FLINK-14038:
--------------------------------------

Thank you [~zhuzh] [~StephanEwen], I finally verified that it's caused by GC. 
and I also put a PR to add some gc options to facillitate debugging, I think 
it's helpful.

> ExecutionGraph deploy failed due to akka timeout
> ------------------------------------------------
>
>                 Key: FLINK-14038
>                 URL: https://issues.apache.org/jira/browse/FLINK-14038
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.9.0
>         Environment: Flink on yarn
> Flink 1.9.0
>            Reporter: liupengcheng
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When launching the flink application, the following error was reported, I 
> downloaded the operator logs, but still have no clue. The operator logs 
> provided no useful information and was cancelled directly.
> JobManager logs:
> {code:java}
> java.lang.IllegalStateException: Update task on TaskManager 
> container_e860_1567429198842_571077_01_000006 @ zjy-hadoop-prc-st320.bj 
> (dataPort=50990) failed due to:
>       at 
> org.apache.flink.runtime.executiongraph.Execution.lambda$sendUpdatePartitionInfoRpcCall$14(Execution.java:1395)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>       at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>       at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>       at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.util.concurrent.CompletionException: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]]
>  after [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
>       at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>       at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>       at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>       at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
>       at akka.dispatch.OnComplete.internal(Future.scala:263)
>       at akka.dispatch.OnComplete.internal(Future.scala:261)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>       at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
>       at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
>       at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>       at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>       at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]]
>  after [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
>       at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>       at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
>       ... 9 more
> {code}
> operator logs:
> {code:java}
> 2019-09-09 18:34:06,867 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Received task 
> Partition (4/5).
> 2019-09-09 18:34:06,868 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched 
> from CREATED to DEPLOYING.
> 2019-09-09 18:34:06,870 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Creating FileSystem stream leak safety net for task Partition 
> (4/5) (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING]
> 2019-09-09 18:34:06,870 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Loading JAR files for task Partition (4/5) 
> (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING].
> 2019-09-09 18:34:06,871 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Registering task at network: Partition (4/5) 
> (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING].
> 2019-09-09 18:34:07,075 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched 
> from DEPLOYING to RUNNING.
> 2019-09-09 18:34:07,255 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Received task 
> Sort-Partition (4/5).
> 2019-09-09 18:34:07,258 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) 
> switched from CREATED to DEPLOYING.
> 2019-09-09 18:34:07,261 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Creating FileSystem stream leak safety net for task 
> Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING]
> 2019-09-09 18:34:07,261 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Loading JAR files for task Sort-Partition (4/5) 
> (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING].
> 2019-09-09 18:34:07,263 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Registering task at network: Sort-Partition (4/5) 
> (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING].
> 2019-09-09 18:34:07,303 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) 
> switched from DEPLOYING to RUNNING.
> 2019-09-09 18:34:54,625 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Attempting to cancel task DataSource (at 
> org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390)
>  (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) 
> (8c6262b3f802f82d60a1999f2e040a68).
> 2019-09-09 18:34:54,806 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - DataSource (at 
> org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390)
>  (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) 
> (8c6262b3f802f82d60a1999f2e040a68) switched from RUNNING to CANCELING.
> 2019-09-09 18:34:54,806 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Triggering cancellation of task code DataSource (at 
> org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390)
>  (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) 
> (8c6262b3f802f82d60a1999f2e040a68).
> {code}
> I checked the network and it's good. so maybe there are some problems with 
> the taskManager? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to