[
https://issues.apache.org/jira/browse/FLINK-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652895#comment-16652895
]
chenlf commented on FLINK-10564:
--------------------------------
wo found some tasks failed after runing serveral hours.here is the error log:
java.lang.Exception: Failed to send ExecutionStateChange notification to
JobManager
at
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:445)
at
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:429)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka.tcp://[email protected]:45332/user/jobmanager#-1560855092]] after
[10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
at
akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
at java.lang.Thread.run(Thread.java:745)
> tm costs too much time when communicating with jm
> --------------------------------------------------
>
> Key: FLINK-10564
> URL: https://issues.apache.org/jira/browse/FLINK-10564
> Project: Flink
> Issue Type: Bug
> Components: Core, JobManager, TaskManager
> Environment: configs are following:
> jm
> high-availability zookeeper
> taskmanager.heap.mb 16384
> taskmanager.memory.preallocate false
> taskmanager.numberOfTaskSlots 64
> tm
> slots 128
> free slots 0-128
> cpu core 40
> Physical Memory 95gb
> free Memory 32gb-50gb
> Flink Managed Memory 22gb-35gb
> Reporter: chenlf
> Priority: Major
> Attachments: timeout.log
>
>
> it works fine until the number of tasks is above about 400.
> There are 600+ tasks(each task handles billion data) running in our cluster
> now,and the problem is it costs too much time (even time out)when
> submiting/canceling/querying a task.
> Recouses like memory,cpu are on normal level.
> after debuging,we found this method is the culprit:
>
> org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String,
> UUID)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)