[ 
https://issues.apache.org/jira/browse/FLINK-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652895#comment-16652895
 ] 

chenlf commented on FLINK-10564:
--------------------------------

wo found some tasks failed after runing serveral hours.here is the error log:

java.lang.Exception: Failed to send ExecutionStateChange notification to 
JobManager
 at 
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:445)
 at 
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:429)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
 at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
 at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
 at 
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
 at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
 at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
 at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 
Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka.tcp://[email protected]:45332/user/jobmanager#-1560855092]] after 
[10000 ms]
 at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
 at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
 at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
 at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
 at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
 at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
 at 
akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
 at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
 at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
 at java.lang.Thread.run(Thread.java:745)

 

 

> tm costs too much time when communicating with  jm
> --------------------------------------------------
>
>                 Key: FLINK-10564
>                 URL: https://issues.apache.org/jira/browse/FLINK-10564
>             Project: Flink
>          Issue Type: Bug
>          Components: Core, JobManager, TaskManager
>         Environment: configs are following:
> jm
> high-availability     zookeeper
> taskmanager.heap.mb   16384
> taskmanager.memory.preallocate        false
> taskmanager.numberOfTaskSlots 64
> tm
> slots 128
> free slots 0-128
> cpu core 40 
> Physical Memory 95gb
> free Memory 32gb-50gb
> Flink Managed Memory 22gb-35gb
>            Reporter: chenlf
>            Priority: Major
>         Attachments: timeout.log
>
>
> it works fine until the number of tasks is above about 400.
>  There are 600+ tasks(each task handles billion data) running in our cluster 
> now,and the problem is it costs too much time (even time out)when 
> submiting/canceling/querying a task.
>  Recouses like memory,cpu are on normal level.
> after debuging,we found this method is the culprit:
>  
> org.apache.flink.runtime.util.LeaderRetrievalUtils.LeaderGatewayListener.notifyLeaderAddress(String,
>  UUID)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to