[jira] [Updated] (FLINK-24293) Tasks from the same job on a machine share user jar

huntercc (Jira) Wed, 15 Sep 2021 19:39:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


huntercc updated FLINK-24293:
-----------------------------
    Description: 
In the current blob storage design, tasks executed by the same TaskExecutor 
will share BLOBs storage dir and tasks executed by different TaskExecutor use 
different dir. As a result, a TaskExecutor has to download user jar even if 
there has been the same user jar downloaded by other TaskExecutors on the 
machine. We believe that there is no need to download many copies of the same 
user jar to the local, two main problems will by exposed:
 # The NIC bandwidth of the distribution terminal may become a bottleneck  
!image-2021-09-15-20-43-17-304.png|width=695,height=193! 
As shown in the figure above, 24640 Mbps of the total 25000 Mbps NIC bandwidth 
is used when we launched a flink job with 4000 TaskManagers, which will cause a 
long deployment time and akka timeout exception.
 # Take up more disk space

We expect to optimize the sharing mechanism of user jar by allowing tasks from 
the same job on a machine to share blob storage dir, more specifically, share 
the user jar in the dir. Only one task deployed to the machine will download 
the user jar from BLOB server or distributed file storage, and the subsequent 
tasks just use the localized user jar. In this way, the user jar of one job 
only needs to be downloaded once on a machine. Here is a comparison of job 
startup time before and after optimization.
||num of TM||before optimization||after optimization||
|1000|62s|37s|
|2000|104s|40s|
|3000|170s|43s|
|4000|211s|45s|

 

  was:
In the current blob storage design, tasks executed by the same TaskExecutor 
will share BLOBs storage dir and tasks executed by different TaskExecutor use 
different dir. As a result, a TaskExecutor has to download user jar even if 
there has been the same user jar downloaded by other TaskExecutors on the 
machine. We believe that there is no need to download many copies of the same 
user jar to the local, two main problems will by exposed:
 # The NIC bandwidth of the distribution terminal may become a bottlenec 
!image-2021-09-15-20-43-17-304.png|width=695,height=193! As shown in the figure 
above, 24640 Mbps of the total 25000 Mbps NIC bandwidth is used when we 
launched a flink job with 4000 TaskManagers, which will cause a long deployment 
time and akka timeout exception.
 # Take up more disk space

We expect to optimize the sharing mechanism of user jar by allowing tasks from 
the same job on a machine to share blob storage dir, more specifically, share 
the user jar in the dir. Only one task deployed to the machine will download 
the user jar from BLOB server or distributed file storage, and the subsequent 
tasks just use the localized user jar. In this way, the user jar of one job 
only needs to be downloaded once on a machine. Here is a comparison of job 
startup time before and after optimization.
||num of TM||before optimization||after optimization||
|1000|62s|37s|
|2000|104s|40s|
|3000|170s|43s|
|4000|211s|45s|

 


> Tasks from the same job on a machine share user jar 
> ----------------------------------------------------
>
>                 Key: FLINK-24293
>                 URL: https://issues.apache.org/jira/browse/FLINK-24293
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-09-15-20-43-11-758.png, 
> image-2021-09-15-20-43-17-304.png
>
>
> In the current blob storage design, tasks executed by the same TaskExecutor 
> will share BLOBs storage dir and tasks executed by different TaskExecutor use 
> different dir. As a result, a TaskExecutor has to download user jar even if 
> there has been the same user jar downloaded by other TaskExecutors on the 
> machine. We believe that there is no need to download many copies of the same 
> user jar to the local, two main problems will by exposed:
>  # The NIC bandwidth of the distribution terminal may become a bottleneck  
> !image-2021-09-15-20-43-17-304.png|width=695,height=193! 
> As shown in the figure above, 24640 Mbps of the total 25000 Mbps NIC 
> bandwidth is used when we launched a flink job with 4000 TaskManagers, which 
> will cause a long deployment time and akka timeout exception.
>  # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks 
> from the same job on a machine to share blob storage dir, more specifically, 
> share the user jar in the dir. Only one task deployed to the machine will 
> download the user jar from BLOB server or distributed file storage, and the 
> subsequent tasks just use the localized user jar. In this way, the user jar 
> of one job only needs to be downloaded once on a machine. Here is a 
> comparison of job startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-24293) Tasks from the same job on a machine share user jar

Reply via email to