[
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403202#comment-17403202
]
Till Rohrmann commented on FLINK-23905:
---------------------------------------
You can also specify classpaths for the job execution via
{{pipeline.classpaths}}
(https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#pipeline-classpaths)
or via {{bin/flink run --classpath URL}}. That way you can store the jars
somewhere accessible and then they don't need to be distributed via the
{{JobManager}}. Moreover, they are part of the user code class loader as long
as the URL is accessible by the {{URLClassLoader}}.
Other than that I fear there is little we can do if the user has a very large
user code jar. When using the session cluster, this file needs to be
distributed to the different {{TaskExecutors}} in order to run the user code.
Do you have a good idea how to get around this?
> Reduce the load on JobManager when submitting large-scale job with a big user
> jar
> ---------------------------------------------------------------------------------
>
> Key: FLINK-23905
> URL: https://issues.apache.org/jira/browse/FLINK-23905
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: huntercc
> Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming
> steps in the job startup phase. Recently, we found that when submitting a
> large-scale job with a large user jar, the time spent on changing the status
> of a task from deploying to running accounts for a high proportion of the
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the
> JobManager through BlobService. JobManager has to allocate a lot of computing
> power to distribute the files, which leads to a heavy load in the start-up
> stage. More generally, JobManager fails to respond to the RPC request sent by
> the TaskManager side in time due to high load, causing some timeout
> exceptions, such as akka timeout exception, which leads to job restart and
> further prolongs the start-up time of the job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)