[
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408704#comment-17408704
]
Zhilong Hong commented on FLINK-23905:
--------------------------------------
Hi [~huntercc], thank you for proposing an idea that makes all TaskExecutors on
one physical machine share the same blob cache, rather than one cache for each
TaskExecutor individually. I think there are two advantages:
# There will be less network communication and less stress for the BlobServer
on JobMaster, since the blobs are shared across the TaskExecutors on the same
physical machine.
# The blobs will take up less disk space on the physical machines.
However, there are two questions that I'm not sure about:
# How to manage the life cycle of blobs? Currently, the blobs are deleted once
the job has been no longer running on the TaskExecutor for 1 hour. Once the
cache is shared across TaskExecutors, we may need a manager in a separate
container/pod to manage the life cycle of all TaskExecutors. There may be
issues related to concurrency. For example, an exception may be raised when a
TaskExecutor intends to read the cache while the manager is deleting the cache.
Furthermore, a manager in a separate container/pod means that we need to take
care of its life cycle.
# I'm not familiar with Docker and Kubernetes. To make all TaskExecutors share
the same blob cache, I think the blob cache must be mounted to the TaskExecutor
pods. Thus, we need to change the template of pods?
> Reduce the load on JobManager when submitting large-scale job with a big user
> jar
> ---------------------------------------------------------------------------------
>
> Key: FLINK-23905
> URL: https://issues.apache.org/jira/browse/FLINK-23905
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: huntercc
> Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming
> steps in the job startup phase. Recently, we found that when submitting a
> large-scale job with a large user jar, the time spent on changing the status
> of a task from deploying to running accounts for a high proportion of the
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the
> JobManager through BlobService. JobManager has to allocate a lot of computing
> power to distribute the files, which leads to a heavy load in the start-up
> stage. More generally, JobManager fails to respond to the RPC request sent by
> the TaskManager side in time due to high load, causing some timeout
> exceptions, such as akka timeout exception, which leads to job restart and
> further prolongs the start-up time of the job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)