[jira] [Commented] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

Till Rohrmann (Jira) Mon, 23 Aug 2021 07:08:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403202#comment-17403202
 ]


Till Rohrmann commented on FLINK-23905:
---------------------------------------

You can also specify classpaths for the job execution via 
{{pipeline.classpaths}} 
(https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#pipeline-classpaths)
 or via {{bin/flink run --classpath URL}}. That way you can store the jars 
somewhere accessible and then they don't need to be distributed via the 
{{JobManager}}. Moreover, they are part of the user code class loader as long 
as the URL is accessible by the {{URLClassLoader}}.

Other than that I fear there is little we can do if the user has a very large 
user code jar. When using the session cluster, this file needs to be 
distributed to the different {{TaskExecutors}} in order to run the user code. 
Do you have a good idea how to get around this?

> Reduce the load on JobManager when submitting large-scale job with a big user 
> jar
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-23905
>                 URL: https://issues.apache.org/jira/browse/FLINK-23905
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
> steps in the job startup phase. Recently, we found that when submitting a 
> large-scale job with a large user jar, the time spent on changing the status 
> of a task from deploying to running accounts for a high proportion of the 
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the 
> JobManager through BlobService. JobManager has to allocate a lot of computing 
> power to distribute the files, which leads to a heavy load in the start-up 
> stage. More generally, JobManager fails to respond to the RPC request sent by 
> the TaskManager side in time due to high load, causing some timeout 
> exceptions, such as akka timeout exception, which leads to job restart and 
> further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

Reply via email to