[jira] [Updated] (FLINK-32484) AdaptiveScheduler combined restart during scaling out

Prabhu Joseph (Jira) Thu, 29 Jun 2023 04:25:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-32484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Prabhu Joseph updated FLINK-32484:
----------------------------------
    Description: 
On a scaling-out operation, when nodes are added at different times, 
AdaptiveScheduler does multiple restarts within a short period of time. On one 
of our Flink jobs, we have seen AdaptiveScheduler restart the ExecutionGraph 
every time there is a notification of new resources to it. There are five 
restarts within 3 minutes.

AdaptiveScheduler could provide a configurable restart window interval to the 
user during which it combines the notified resources and restarts once when the 
available resources are sufficient to fit the desired parallelism or when the 
window times out. The window is created during the first notification of 
resources received. This is applicable only when the execution graph is in the 
executing state and not in the waiting for resources state.

 
{code:java}
[root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# grep -i scale *
jobmanager.log:2023-06-29 10:46:58,061 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:47:57,317 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:48:53,314 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:49:27,821 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:50:15,672 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
[root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# {code}
 

  was:
On a scaling-out operation, when nodes are added at different times, 
AdaptiveScheduler does multiple restarts within a short period of time. On one 
of our Flink jobs, we have seen AdaptiveScheduler restart the ExecutionGraph 
every time there is a notification of new resources to it. There are five 
restarts within 3 minutes.

AdaptiveScheduler could provide a configurable restart window interval to the 
user during which it combines the notified resources and restarts once when the 
available resources are sufficient to fit the desired parallelism or when the 
window times out. The window is created during the first notification of 
resources received. This is applicable only when the execution graph is in the 
executing state and not in the waiting for resources state.

 
{code:java}
[root@ip-172-31-40-185 container_1688034805200_0002_01_000001]# grep -i scale *
jobmanager.log:2023-06-29 10:46:58,061 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:47:57,317 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:48:53,314 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:49:27,821 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
jobmanager.log:2023-06-29 10:50:15,672 INFO  
org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
resources are available. Restarting job to scale up.
[root@ip-172-31-40-185 container_1688034805200_0002_01_000001]# {code}
 


> AdaptiveScheduler combined restart during scaling out
> -----------------------------------------------------
>
>                 Key: FLINK-32484
>                 URL: https://issues.apache.org/jira/browse/FLINK-32484
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / Core
>    Affects Versions: 1.17.0
>            Reporter: Prabhu Joseph
>            Priority: Major
>
> On a scaling-out operation, when nodes are added at different times, 
> AdaptiveScheduler does multiple restarts within a short period of time. On 
> one of our Flink jobs, we have seen AdaptiveScheduler restart the 
> ExecutionGraph every time there is a notification of new resources to it. 
> There are five restarts within 3 minutes.
> AdaptiveScheduler could provide a configurable restart window interval to the 
> user during which it combines the notified resources and restarts once when 
> the available resources are sufficient to fit the desired parallelism or when 
> the window times out. The window is created during the first notification of 
> resources received. This is applicable only when the execution graph is in 
> the executing state and not in the waiting for resources state.
>  
> {code:java}
> [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# grep -i scale *
> jobmanager.log:2023-06-29 10:46:58,061 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:47:57,317 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:48:53,314 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:49:27,821 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:50:15,672 INFO  
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New 
> resources are available. Restarting job to scale up.
> [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-32484) AdaptiveScheduler combined restart during scaling out

Reply via email to