[
https://issues.apache.org/jira/browse/FLINK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812796#comment-17812796
]
David Morávek commented on FLINK-33092:
---------------------------------------
This is something we've discussed internally as well. I think overall we should
take a bigger stab on the whole AdaptiveScheduler configuration to make the
whole story more round.
* We can probably get rid of RescalingController (it exists for historical
reasons)
* Redefine how cooldown periods work (they shouldn't restart every time
something changes), especially in combination with bringing
_resource-stabilization-timeout_ into executing state{_}.{_}
* Allowing rescaling to wait for next complete checkpoint (this is what
triggered us to look into the whole configuration story)
I've also already done some work in the direction, it would be great to align
on it if you have some time this week (the first steps around fixing the
cooldown periods and consolidating AS settings are already merged).
> Improve the resource-stabilization-timeout mechanism when rescale a job for
> Adaptive Scheduler
> ----------------------------------------------------------------------------------------------
>
> Key: FLINK-33092
> URL: https://issues.apache.org/jira/browse/FLINK-33092
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
> Attachments: image-2023-09-15-14-43-35-104.png
>
>
> !image-2023-09-15-14-43-35-104.png|width=916,height=647!
> h1. 1. Propose
> The above is the state transition graph when rescale a job in Adaptive
> Scheduler.
> In brief, when we trigger a rescale, the job will wait
> _*resource-stabilization-timeout*_ in WaitingForResources State when it has
> sufficient resources and it doesn't have the desired resource.
> If the _*resource-stabilization-timeout mechanism*_ is moved into the
> Executing State, the rescale downtime will be significantly reduced.
> h1. 2. Why the downtime is long?
> Currently, when rescale a job:
> * The Executing will transition to Restarting
> * The Restarting will cancel this job first.
> * The Restarting will transition to WaitingForResources after the whole job
> is terminal.
> * When this job has sufficient resources and it doesn't have the desired
> resource, the WaitingForResources needs to wait
> _*resource-stabilization-timeout*_ .
> * WaitingForResources will transition to CreatingExecutionGraph after
> resource-stabilization-timeout.
> The problem is the job isn't running during the
> resource-stabilization-timeout phase.
> h1. 3. How to reduce the downtime?
> We can move the _*resource-stabilization-timeout mechanism*_ into the
> Executing State when trigger a rescale. It means:
> * When this job has desired resources, the Executing can rescale directly.
> * When this job has sufficient resources and it doesn't have the desired
> resource, we can rescale after _*resource-stabilization-timeout.*_
> * The WaitingForResources will ignore the resource-stabilization-timeout
> after this improvement.
> The resource-stabilization-timeout works before cancel job, so the rescale
> downtime will be significantly reduced.
>
> Note: the resource-stabilization-timeout still works in WaitingForResources
> when start a job. It's just changed when rescale a job.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)