[jira] [Commented] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

huntercc (Jira) Mon, 06 Sep 2021 20:42:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410911#comment-17410911
 ]


huntercc commented on FLINK-23905:
----------------------------------

Thanks for providing those practical doubts, Zhilong. `Flink on yarn` may be 
the long-term deployment mode for our team. As a result, We hope to optimize 
the performance for large job submission on yarn preferentially. There is no a 
complete plan for `Flink on k8s` at the moment. However, I think the 
modification mentioned-above may not bring a worse result even if we don't 
mount the public _Blob dir_ for each TaskExecutor pod, which just means a 
degradation of shareability.

> Reduce the load on JobManager when submitting large-scale job with a big user 
> jar
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-23905
>                 URL: https://issues.apache.org/jira/browse/FLINK-23905
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
> steps in the job startup phase. Recently, we found that when submitting a 
> large-scale job with a large user jar, the time spent on changing the status 
> of a task from deploying to running accounts for a high proportion of the 
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the 
> JobManager through BlobService. JobManager has to allocate a lot of computing 
> power to distribute the files, which leads to a heavy load in the start-up 
> stage. More generally, JobManager fails to respond to the RPC request sent by 
> the TaskManager side in time due to high load, causing some timeout 
> exceptions, such as akka timeout exception, which leads to job restart and 
> further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

Reply via email to