[jira] [Commented] (FLINK-32484) AdaptiveScheduler combined restart during scaling out

Prabhu Joseph (Jira) Thu, 29 Jun 2023 05:10:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738530#comment-17738530
 ]


Prabhu Joseph commented on FLINK-32484:
---------------------------------------

[~gyfora] If you are fine with this idea, could you assign this ticket to me? I 
can work on this and come up with a patch.

> AdaptiveScheduler combined restart during scaling out
> -----------------------------------------------------
>
>                 Key: FLINK-32484
>                 URL: https://issues.apache.org/jira/browse/FLINK-32484
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / Core
>    Affects Versions: 1.17.0
>            Reporter: Prabhu Joseph
>            Priority: Major
>
> On a scaling-out operation, when nodes are added at different times, 
> AdaptiveScheduler does multiple restarts within a short period of time. On 
> one of our Flink jobs, we have seen AdaptiveScheduler restart the 
> ExecutionGraph every time there is a notification of new resources to it. 
> There are five restarts within 3 minutes.
> AdaptiveScheduler could provide a configurable restart window interval to the 
> user during which it combines the notified resources and restarts once when 
> the available resources are sufficient to fit the desired parallelism or when 
> the window times out. The window is created during the first notification of 
> resources received. This is applicable only when the execution graph is in 
> the executing state and not in the waiting for resources state.
>  
> {code:java}
> [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# grep -i scale *
> jobmanager.log:2023-06-29 10:46:58,061 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:47:57,317 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:48:53,314 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:49:27,821 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:50:15,672 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32484) AdaptiveScheduler combined restart during scaling out

Reply via email to