[
https://issues.apache.org/jira/browse/FLINK-32484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738530#comment-17738530
]
Prabhu Joseph commented on FLINK-32484:
---------------------------------------
[~gyfora] If you are fine with this idea, could you assign this ticket to me? I
can work on this and come up with a patch.
> AdaptiveScheduler combined restart during scaling out
> -----------------------------------------------------
>
> Key: FLINK-32484
> URL: https://issues.apache.org/jira/browse/FLINK-32484
> Project: Flink
> Issue Type: Improvement
> Components: API / Core
> Affects Versions: 1.17.0
> Reporter: Prabhu Joseph
> Priority: Major
>
> On a scaling-out operation, when nodes are added at different times,
> AdaptiveScheduler does multiple restarts within a short period of time. On
> one of our Flink jobs, we have seen AdaptiveScheduler restart the
> ExecutionGraph every time there is a notification of new resources to it.
> There are five restarts within 3 minutes.
> AdaptiveScheduler could provide a configurable restart window interval to the
> user during which it combines the notified resources and restarts once when
> the available resources are sufficient to fit the desired parallelism or when
> the window times out. The window is created during the first notification of
> resources received. This is applicable only when the execution graph is in
> the executing state and not in the waiting for resources state.
>
> {code:java}
> [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# grep -i scale *
> jobmanager.log:2023-06-29 10:46:58,061 INFO
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:47:57,317 INFO
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:48:53,314 INFO
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:49:27,821 INFO
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New
> resources are available. Restarting job to scale up.
> jobmanager.log:2023-06-29 10:50:15,672 INFO
> org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New
> resources are available. Restarting job to scale up.
> [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)