[jira] [Commented] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

Zhilong Hong (Jira) Thu, 02 Sep 2021 02:18:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408704#comment-17408704
 ]


Zhilong Hong commented on FLINK-23905:
--------------------------------------

Hi [~huntercc], thank you for proposing an idea that makes all TaskExecutors on 
one physical machine share the same blob cache, rather than one cache for each 
TaskExecutor individually. I think there are two advantages:
 # There will be less network communication and less stress for the BlobServer 
on JobMaster, since the blobs are shared across the TaskExecutors on the same 
physical machine.
 # The blobs will take up less disk space on the physical machines.

However, there are two questions that I'm not sure about:
 # How to manage the life cycle of blobs? Currently, the blobs are deleted once 
the job has been no longer running on the TaskExecutor for 1 hour. Once the 
cache is shared across TaskExecutors, we may need a manager in a separate 
container/pod to manage the life cycle of all TaskExecutors. There may be 
issues related to concurrency. For example, an exception may be raised when a 
TaskExecutor intends to read the cache while the manager is deleting the cache. 
Furthermore, a manager in a separate container/pod means that we need to take 
care of its life cycle.
 # I'm not familiar with Docker and Kubernetes. To make all TaskExecutors share 
the same blob cache, I think the blob cache must be mounted to the TaskExecutor 
pods. Thus, we need to change the template of pods?

> Reduce the load on JobManager when submitting large-scale job with a big user 
> jar
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-23905
>                 URL: https://issues.apache.org/jira/browse/FLINK-23905
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming 
> steps in the job startup phase. Recently, we found that when submitting a 
> large-scale job with a large user jar, the time spent on changing the status 
> of a task from deploying to running accounts for a high proportion of the 
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the 
> JobManager through BlobService. JobManager has to allocate a lot of computing 
> power to distribute the files, which leads to a heavy load in the start-up 
> stage. More generally, JobManager fails to respond to the RPC request sent by 
> the TaskManager side in time due to high load, causing some timeout 
> exceptions, such as akka timeout exception, which leads to job restart and 
> further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

Reply via email to