[
https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
huntercc reopened FLINK-24293:
------------------------------
> Tasks from the same job on a machine share user jar
> ----------------------------------------------------
>
> Key: FLINK-24293
> URL: https://issues.apache.org/jira/browse/FLINK-24293
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: huntercc
> Priority: Major
> Attachments: image-2021-09-15-19-39-00-101.png,
> image-2021-09-15-19-39-40-461.png
>
>
> In the current blob storage design, tasks executed by the same TaskExecutor
> will share BLOBs storage dir and tasks executed by different TaskExecutor use
> different dir. As a result, a TaskExecutor has to download user jar even if
> there has been the same user jar downloaded by other TaskExecutors on the
> machine. We believe that there is no need to download many copies of the same
> user jar to the local, two main problems will by exposed:
> # The NIC bandwidth of the distribution terminal may become a bottleneck
> !image-2021-09-15-19-39-40-461.png|width=651,height=216! As shown in the
> figure above, 24640 Mbps of the total 25000 Mbps NIC bandwidth is used when
> we launched a flink job with 4000 TaskManagers, which will cause a long
> deployment time and akka timeout exception.
> # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks
> from the same job on a machine to share blob storage dir, more specifically,
> share the user jar in the dir. Only one task deployed to the machine will
> download the user jar from BLOB server or distributed file storage, and the
> subsequent tasks just use the localized user jar. In this way, the user jar
> of one job only needs to be downloaded once on a machine. Here is a
> comparison of job startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)