[
https://issues.apache.org/jira/browse/FLINK-23905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408518#comment-17408518
]
huntercc edited comment on FLINK-23905 at 9/2/21, 3:58 AM:
-----------------------------------------------------------
Hi [~trohrmann]. Recently, we tried to transform
org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree
of BLOBs. More specifically, we allow TaskManagers from the same job on a
machine to share blob files so that only one user jar is downloaded for a
machine. Comparing the current method of sharing multiple tasks within a TM
with ours, we believe that the two kinds of resource isolation are equivalent
theoretically, especially there is no constraints on which tasks can be
deployed in the current TM. I can find few differences between sharing BLOBs
among 10 tasks in the same TM and sharing the files among 10 TM containing
single task. Nevertheless, we would like you to help assess whether there are
risks that we have not considered from a more professional perspective.
was (Author: huntercc):
hi [~trohrmann]. Recently, we tried to transform
org.apache.flink.runtime.blob.AbstractBlobCache by changing the sharing degree
of BLOBs. More specifically, we allow TaskManagers from the same job on a
machine to share blob files so that only one user jar is downloaded for a
machine. Comparing the current method of sharing multiple tasks within a TM
with ours, we believe that the two kinds of resource isolation are equivalent
theoretically, especially there is no constraints on which tasks can be
deployed in the current TM. I can find few differences between sharing BLOBs
among 10 tasks in the same TM and sharing the files among 10 TM containing
single task. Nevertheless, we would like you to help assess whether there are
risks that we have not considered from a more professional perspective.
> Reduce the load on JobManager when submitting large-scale job with a big user
> jar
> ---------------------------------------------------------------------------------
>
> Key: FLINK-23905
> URL: https://issues.apache.org/jira/browse/FLINK-23905
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: huntercc
> Priority: Major
>
> As described in FLINK-20612 and FLINK-21731, there are some time-consuming
> steps in the job startup phase. Recently, we found that when submitting a
> large-scale job with a large user jar, the time spent on changing the status
> of a task from deploying to running accounts for a high proportion of the
> total time-consuming.
> In the task initialization stage, the user jar needs to be pulled from the
> JobManager through BlobService. JobManager has to allocate a lot of computing
> power to distribute the files, which leads to a heavy load in the start-up
> stage. More generally, JobManager fails to respond to the RPC request sent by
> the TaskManager side in time due to high load, causing some timeout
> exceptions, such as akka timeout exception, which leads to job restart and
> further prolongs the start-up time of the job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)