[
https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-24293:
-----------------------------------
Labels: pull-request-available stale-major (was: pull-request-available)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issues has been marked as
Major but is unassigned and neither itself nor its Sub-Tasks have been updated
for 60 days. I have gone ahead and added a "stale-major" to the issue". If this
ticket is a Major, please either assign yourself or give an update. Afterwards,
please remove the label or in 7 days the issue will be deprioritized.
> Tasks from the same job on a machine share user jar
> ----------------------------------------------------
>
> Key: FLINK-24293
> URL: https://issues.apache.org/jira/browse/FLINK-24293
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: huntercc
> Priority: Major
> Labels: pull-request-available, stale-major
> Attachments: image-2021-09-15-20-43-11-758.png,
> image-2021-09-15-20-43-17-304.png
>
>
> In the current blob storage design, tasks executed by the same TaskExecutor
> will share BLOBs storage dir and tasks executed by different TaskExecutor use
> different dir. As a result, a TaskExecutor has to download user jar even if
> there has been the same user jar downloaded by other TaskExecutors on the
> machine. We believe that there is no need to download many copies of the same
> user jar to the local, two main problems will by exposed:
> # The NIC bandwidth of the distribution terminal may become a bottleneck
> !image-2021-09-15-20-43-17-304.png|width=695,height=193!
> As shown in the figure above, 24640 Mbps of the total 25000 Mbps NIC
> bandwidth is used when we launched a flink job with 4000 TaskManagers, which
> will cause a long deployment time and akka timeout exception.
> # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks
> from the same job on a machine to share blob storage dir, more specifically,
> share the user jar in the dir. Only one task deployed to the machine will
> download the user jar from BLOB server or distributed file storage, and the
> subsequent tasks just use the localized user jar. In this way, the user jar
> of one job only needs to be downloaded once on a machine. Here is a
> comparison of job startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)