[jira] [Commented] (FLINK-13184) Support launching task executors with multi-thread on YARN.

Qi (JIRA) Thu, 11 Jul 2019 07:49:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883040#comment-16883040
 ]


Qi commented on FLINK-13184:
----------------------------

Below are the TM error log in this case:
———————————————————————— 
 
2019-07-09 13:56:59,110 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor 
- Connecting to ResourceManager 
akka.tcp://flink@xxx/user/resourcemanager(00000000000000000000000000000000). 
2019-07-09 14:00:01,138 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor 
- Could not resolve ResourceManager address 
akka.tcp://flink@xxx/user/resourcemanager, retrying in 10000 ms: Ask timed out 
on [ActorSelection[Anchor(akka.tcp://flink@xxx/), Path(/user/resourcemanager)]] 
after [182000 ms]. Sender[null] sent message of type "akka.actor.Identify".. 
2019-07-09 14:01:59,137 ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error occurred in 
TaskExecutor akka.tcp://flink@xxx/user/taskmanager_0. 
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: 
Could not register at the ResourceManager within the specified maximum 
registration duration 300000 ms. This indicates a problem with this instance. 
Terminating now. at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1023)
 at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1009)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
 at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) 
at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) 
at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at 
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at 
akka.actor.ActorCell.invoke(ActorCell.scala:495) at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at 
akka.dispatch.Mailbox.run(Mailbox.scala:224) at 
akka.dispatch.Mailbox.exec(Mailbox.scala:234) at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

> Support launching task executors with multi-thread on YARN.
> -----------------------------------------------------------
>
>                 Key: FLINK-13184
>                 URL: https://issues.apache.org/jira/browse/FLINK-13184
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>    Affects Versions: 1.8.1, 1.9.0
>            Reporter: Xintong Song
>            Assignee: Xintong Song
>            Priority: Major
>
> Currently, YarnResourceManager starts all task executors in main thread. This 
> could cause RM thread becomes unresponsive when launching a large number of 
> TEs (e.g. > 1000), leading to TE registration/heartbeat timeouts.
>  
> In Blink, we have a thread pool that RM starts TEs through the YARN NMClient 
> in separated threads. I think we should add this feature to the Flink master 
> branch as well.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-13184) Support launching task executors with multi-thread on YARN.

Reply via email to