[jira] [Comment Edited] (FLINK-13184) Starting a TaskExecutor blocks the YarnResourceManager's main thread

Yang Wang (Jira) Sun, 10 Nov 2019 04:25:22 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971079#comment-16971079
 ]


Yang Wang edited comment on FLINK-13184 at 11/10/19 12:24 PM:
--------------------------------------------------------------

[~trohrmann]

`NMClientAsync` use a internal thread pool to execute all the container launch 
event. When we call `startContainerAsync`, it just put a event to the blocking 
queue. So replacing the `NMClient` to `NMClientAsync` will not take many 
unknown implications. Also I will run more tests in our yarn cluster.

I think using dynamic properties instead of uploading a hdfs file is better. We 
have some big flink applications with more than 5000 containers. Even we use a 
thread pool to upload the config file to hdfs, it will still be very slow. 
Since the config for task manager and config uploaded by flink client have a 
small difference, so using dynamic properties is reasonable.


was (Author: fly_in_gis):
[~trohrmann]

`NMClientAsync` use a internal thread pool to execute all the container launch 
event. When we call `startContainerAsync`, it just put a event to the blocking 
queue. So replacing the `NMClient` to `NMClientAsync` will not take many 
unknown implications. Also I will run more tests in our yarn cluster.

I think using dynamic properties instead of uploading a hdfs file is better. We 
have some big flink applications with more than 5000 containers. Even we use a 
thread pool to upload the config file to hdfs, it will still be very slow. 
Since the config for task manager and config uploaded by flink client have a 
small difference, so use dynamic properties is reasonable.

> Starting a TaskExecutor blocks the YarnResourceManager's main thread
> --------------------------------------------------------------------
>
>                 Key: FLINK-13184
>                 URL: https://issues.apache.org/jira/browse/FLINK-13184
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.8.1, 1.9.0, 1.10.0
>            Reporter: Xintong Song
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.10.0, 1.8.3, 1.9.2
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, YarnResourceManager starts all task executors in main thread. This 
> could cause RM to become unresponsive when launching a large number of TEs 
> (e.g. > 1000) because it involves blocking I/O operations (writing files to 
> HDFS, communicating with the node manager using a synchronous {{NMClient}}). 
> As a consequence, TE registration/heartbeat timeouts can occur and Flink 
> might allocate too many excessive containers (see FLINK-12342) because it 
> cannot process the {{YarnResourceManager#onContainersAllocated}} calls.
> There are different solution approaches but the end goal should be to not 
> execute any blocking calls in the {{ResourceManager's}} main thread:
> 1. Start the TaskExecutors from a different thread (potentially thread pool) 
> which is responsible for uploading the files and communicating with the 
> NodeManager
> 2. Don't upload files (avoid blocking file system operations) and use the 
> {{NMClientAsync}} for the communication with Yarn's {{NodeManager}}.
> 3. Upload files in a separate I/O thread and use the {{NMClientAsync}} for 
> the communication with Yarn's {{NodeManager}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-13184) Starting a TaskExecutor blocks the YarnResourceManager's main thread

Reply via email to