[
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408518#comment-17408518
]
huntercc edited comment on FLINK-23905 at 9/2/21, 3:57 AM:
-----------------------------------------------------------
hi [~trohrmann]. Recently, we tried to transform
org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree
of BLOBs. More specifically, we allow TaskManagers from the same job on a
machine to share blob files so that only one user jar is downloaded for a
machine. Comparing the current method of sharing multiple tasks within a TM
with ours, we believe that the two kinds of resource isolation are equivalent
theoretically, especially there is no constraints on which tasks can be
deployed in the current TM. I can find few differences between sharing BLOBs
among 10 tasks in the same TM and sharing the files among 10 TM containing
single task. Nevertheless, we would like you to help assess whether there are
risks that we have not considered from a more professional perspective.
was (Author: huntercc):
Recently, we tried to transform org.apache.flink.runtime.blob.AbstractBlobCache
by changing the sharing degree of BLOBs. More specifically, we allow
TaskManagers from the same job on a machine to share blob files so that only
one user jar is downloaded for a machine. Comparing the current method of
sharing multiple tasks within a TM with ours, we believe that the two kinds of
resource isolation are equivalent theoretically, especially there is no
constraints on which tasks can be deployed in the current TM. I can find few
differences between sharing BLOBs among 10 tasks in the same TM and sharing the
files among 10 TM containing single task. Nevertheless, we would like you to
help assess whether there are risks that we have not considered from a more
professional perspective.
> Reduce the load on JobManager when submitting large-scale job with a big user
> jar
> ---------------------------------------------------------------------------------
>
> Key: FLINK-23905
> URL: https://issues.apache.org/jira/browse/FLINK-23905
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: huntercc
> Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming
> steps in the job startup phase. Recently, we found that when submitting a
> large-scale job with a large user jar, the time spent on changing the status
> of a task from deploying to running accounts for a high proportion of the
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the
> JobManager through BlobService. JobManager has to allocate a lot of computing
> power to distribute the files, which leads to a heavy load in the start-up
> stage. More generally, JobManager fails to respond to the RPC request sent by
> the TaskManager side in time due to high load, causing some timeout
> exceptions, such as akka timeout exception, which leads to job restart and
> further prolongs the start-up time of the job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)