[jira] [Commented] (FLINK-33092) Improve the resource-stabilization-timeout mechanism when rescale a job for Adaptive Scheduler

Jira Wed, 31 Jan 2024 08:43:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812796#comment-17812796
 ]


David Morávek commented on FLINK-33092:
---------------------------------------

This is something we've discussed internally as well. I think overall we should 
take a bigger stab on the whole AdaptiveScheduler configuration to make the 
whole story more round.

 
 * We can probably get rid of RescalingController (it exists for historical 
reasons)
 * Redefine how cooldown periods work (they shouldn't restart every time 
something changes), especially in combination with bringing 
_resource-stabilization-timeout_ into executing state{_}.{_}
 * Allowing rescaling to wait for next complete checkpoint (this is what 
triggered us to look into the whole configuration story)

 

I've also already done some work in the direction, it would be great to align 
on it if you have some time this week (the first steps around fixing the 
cooldown periods and consolidating AS settings are already merged).

 

 

> Improve the resource-stabilization-timeout mechanism when rescale a job for 
> Adaptive Scheduler
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33092
>                 URL: https://issues.apache.org/jira/browse/FLINK-33092
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>         Attachments: image-2023-09-15-14-43-35-104.png
>
>
> !image-2023-09-15-14-43-35-104.png|width=916,height=647!
> h1. 1. Propose
> The above is the state transition graph when rescale a job in Adaptive 
> Scheduler.
> In brief, when we trigger a rescale, the job will wait 
> _*resource-stabilization-timeout*_ in WaitingForResources State when it has 
> sufficient resources and it doesn't have the desired resource.
> If the _*resource-stabilization-timeout mechanism*_ is moved into the 
> Executing State, the rescale downtime will be significantly reduced.
> h1. 2. Why the downtime is long?
> Currently, when rescale a job:
>  * The Executing will transition to Restarting
>  * The Restarting will cancel this job first.
>  * The Restarting will transition to WaitingForResources after the whole job 
> is terminal.
>  * When this job has sufficient resources and it doesn't have the desired 
> resource, the WaitingForResources needs to wait  
> _*resource-stabilization-timeout*_ .
>  * WaitingForResources will transition to CreatingExecutionGraph after  
> resource-stabilization-timeout.
> The problem is the job isn't running during the 
> resource-stabilization-timeout phase.
> h1. 3. How to reduce the downtime?
> We can move the _*resource-stabilization-timeout mechanism*_ into the 
> Executing State when trigger a rescale. It means:
>  * When this job has desired resources, the Executing can rescale directly.
>  * When this job has sufficient resources and it doesn't have the desired 
> resource, we can rescale after _*resource-stabilization-timeout.*_
>  * The WaitingForResources will ignore the resource-stabilization-timeout 
> after this improvement.
> The resource-stabilization-timeout works before cancel job, so the rescale 
> downtime will be significantly reduced.
>  
> Note: the resource-stabilization-timeout still works in WaitingForResources 
> when start a job. It's just changed when rescale a job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33092) Improve the resource-stabilization-timeout mechanism when rescale a job for Adaptive Scheduler

Reply via email to