[ 
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411035#comment-17411035
 ] 

Zhilong Hong commented on FLINK-23905:
--------------------------------------

I think a general solution would be better for both YARN and K8S environments.

Furthermore, I'm so curious about your implementation, especially the 
management of life cycle of blobs. Would you mind sharing it with us?

> Reduce the load on JobManager when submitting large-scale job with a big user 
> jar
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-23905
>                 URL: https://issues.apache.org/jira/browse/FLINK-23905
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
> steps in the job startup phase. Recently, we found that when submitting a 
> large-scale job with a large user jar, the time spent on changing the status 
> of a task from deploying to running accounts for a high proportion of the 
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the 
> JobManager through BlobService. JobManager has to allocate a lot of computing 
> power to distribute the files, which leads to a heavy load in the start-up 
> stage. More generally, JobManager fails to respond to the RPC request sent by 
> the TaskManager side in time due to high load, causing some timeout 
> exceptions, such as akka timeout exception, which leads to job restart and 
> further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to