[ https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
huntercc reopened FLINK-23905: ------------------------------ > Reduce the load on JobManager when submitting large-scale job with a big user > jar > --------------------------------------------------------------------------------- > > Key: FLINK-23905 > URL: https://issues.apache.org/jira/browse/FLINK-23905 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: huntercc > Priority: Major > > As described in FLINK-20612 and FLINK-21731, there are some time-consuming > steps in the job startup phase. Recently, we found that when submitting a > large-scale job with a large user jar, the time spent on changing the status > of a task from deploying to running accounts for a high proportion of the > total time-consuming. > In the task initialization stage, the user jar needs to be pulled from the > JobManager through BlobService. JobManager has to allocate a lot of computing > power to distribute the files, which leads to a heavy load in the start-up > stage. More generally, JobManager fails to respond to the RPC request sent by > the TaskManager side in time due to high load, causing some timeout > exceptions, such as akka timeout exception, which leads to job restart and > further prolongs the start-up time of the job. -- This message was sent by Atlassian Jira (v8.3.4#803005)