[
https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420605#comment-17420605
]
huntercc commented on FLINK-24293:
----------------------------------
I don't think it is a ideal solution for us by sing "yarn.provided.lib.dirs"
for user jars, because it's difficult to get the dependency tree of all jobs
and update the files in the dir when user changes their job's dependency. In
addition, I think all dependency jars have to be stored under the same dir for
yarn session mode, because a resident session cluster may start a new job with
any dependency. This, in turn, will lead to more serious additional downloads?
> Tasks from the same job on a machine share user jar
> ----------------------------------------------------
>
> Key: FLINK-24293
> URL: https://issues.apache.org/jira/browse/FLINK-24293
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: huntercc
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2021-09-15-20-43-11-758.png,
> image-2021-09-15-20-43-17-304.png
>
>
> In the current blob storage design, tasks executed by the same TaskExecutor
> will share BLOBs storage dir and tasks executed by different TaskExecutor use
> different dir. As a result, a TaskExecutor has to download user jar even if
> there has been the same user jar downloaded by other TaskExecutors on the
> machine. We believe that there is no need to download many copies of the same
> user jar to the local, two main problems will by exposed:
> # The NIC bandwidth of the distribution terminal may become a bottleneck
> !image-2021-09-15-20-43-17-304.png|width=695,height=193!
> As shown in the figure above, 24640 Mbps of the total 25000 Mbps NIC
> bandwidth is used when we launched a flink job with 4000 TaskManagers, which
> will cause a long deployment time and akka timeout exception.
> # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks
> from the same job on a machine to share blob storage dir, more specifically,
> share the user jar in the dir. Only one task deployed to the machine will
> download the user jar from BLOB server or distributed file storage, and the
> subsequent tasks just use the localized user jar. In this way, the user jar
> of one job only needs to be downloaded once on a machine. Here is a
> comparison of job startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)