[jira] [Updated] (FLINK-24293) Tasks from the same job on a machine share user jar

Flink Jira Bot (Jira) Sat, 27 Nov 2021 02:40:12 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-24293:
-----------------------------------
    Labels: pull-request-available stale-major  (was: pull-request-available)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help 
the community manage its development. I see this issues has been marked as 
Major but is unassigned and neither itself nor its Sub-Tasks have been updated 
for 60 days. I have gone ahead and added a "stale-major" to the issue". If this 
ticket is a Major, please either assign yourself or give an update. Afterwards, 
please remove the label or in 7 days the issue will be deprioritized.


> Tasks from the same job on a machine share user jar 
> ----------------------------------------------------
>
>                 Key: FLINK-24293
>                 URL: https://issues.apache.org/jira/browse/FLINK-24293
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>              Labels: pull-request-available, stale-major
>         Attachments: image-2021-09-15-20-43-11-758.png, 
> image-2021-09-15-20-43-17-304.png
>
>
> In the current blob storage design, tasks executed by the same TaskExecutor 
> will share BLOBs storage dir and tasks executed by different TaskExecutor use 
> different dir. As a result, a TaskExecutor has to download user jar even if 
> there has been the same user jar downloaded by other TaskExecutors on the 
> machine. We believe that there is no need to download many copies of the same 
> user jar to the local, two main problems will by exposed:
>  # The NIC bandwidth of the distribution terminal may become a bottleneck  
> !image-2021-09-15-20-43-17-304.png|width=695,height=193! 
> As shown in the figure above, 24640 Mbps of the total 25000 Mbps NIC 
> bandwidth is used when we launched a flink job with 4000 TaskManagers, which 
> will cause a long deployment time and akka timeout exception.
>  # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks 
> from the same job on a machine to share blob storage dir, more specifically, 
> share the user jar in the dir. Only one task deployed to the machine will 
> download the user jar from BLOB server or distributed file storage, and the 
> subsequent tasks just use the localized user jar. In this way, the user jar 
> of one job only needs to be downloaded once on a machine. Here is a 
> comparison of job startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (FLINK-24293) Tasks from the same job on a machine share user jar

Reply via email to