[ 
https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huntercc reopened FLINK-24293:
------------------------------

> Tasks from the same job on a machine share user jar 
> ----------------------------------------------------
>
>                 Key: FLINK-24293
>                 URL: https://issues.apache.org/jira/browse/FLINK-24293
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>         Attachments: image-2021-09-15-19-39-00-101.png, 
> image-2021-09-15-19-39-40-461.png
>
>
> In the current blob storage design, tasks executed by the same TaskExecutor 
> will share BLOBs storage dir and tasks executed by different TaskExecutor use 
> different dir. As a result, a TaskExecutor has to download user jar even if 
> there has been the same user jar downloaded by other TaskExecutors on the 
> machine. We believe that there is no need to download many copies of the same 
> user jar to the local, two main problems will by exposed:
>  # The NIC bandwidth of the distribution terminal may become a bottleneck 
> !image-2021-09-15-19-39-40-461.png|width=651,height=216! As shown in the 
> figure above, 24640 Mbps of the total 25000 Mbps NIC bandwidth is used when 
> we launched a flink job with 4000 TaskManagers, which will cause a long 
> deployment time and akka timeout exception.
>  # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks 
> from the same job on a machine to share blob storage dir, more specifically, 
> share the user jar in the dir. Only one task deployed to the machine will 
> download the user jar from BLOB server or distributed file storage, and the 
> subsequent tasks just use the localized user jar. In this way, the user jar 
> of one job only needs to be downloaded once on a machine. Here is a 
> comparison of job startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to