[
https://issues.apache.org/jira/browse/FLINK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rui Fan updated FLINK-33092:
----------------------------
Description:
!image-2023-09-15-14-43-35-104.png|width=916,height=647!
h1. 1. Propose
The above is the state transition graph when rescale a job in Adaptive
Scheduler.
In brief, when we trigger a rescale, the job will wait
_*resource-stabilization-timeout*_ in WaitingForResources State when it has
sufficient resources and it doesn't have the desired resource.
If the _*resource-stabilization-timeout mechanism*_ is moved into the Executing
State, the rescale downtime will be significantly reduced.
h1. 2. Why the downtime is long?
Currently, when rescale a job:
* The Executing will transition to Restarting
* The Restarting will cancel this job first.
* The Restarting will transition to WaitingForResources after the whole job is
terminal.
* When this job has sufficient resources and it doesn't have the desired
resource, the WaitingForResources needs to wait
_*resource-stabilization-timeout*_ .
* WaitingForResources will transition to CreatingExecutionGraph after
resource-stabilization-timeout.
The problem is the job isn't running during the resource-stabilization-timeout
phase.
h1. 3. How to reduce the downtime?
We can move the _*resource-stabilization-timeout mechanism*_ into the Executing
State when trigger a rescale. It means:
* When this job has desired resources, the Executing can rescale directly.
* When this job has sufficient resources and it doesn't have the desired
resource, we can rescale after _*resource-stabilization-timeout.*_
* The WaitingForResources will ignore the resource-stabilization-timeout after
this improvement.
The resource-stabilization-timeout works before cancel job, so the rescale
downtime will be significantly reduced.
Note: the resource-stabilization-timeout still works in WaitingForResources
when start a job. It's just changed when rescale a job.
was:
!image-2023-09-15-14-43-35-104.png|width=1103,height=779!
h1. 1. Propose
The above is the state transition graph when rescale a job in Adaptive
Scheduler.
In brief, when we trigger a rescale, the job will wait
_*resource-stabilization-timeout*_ in WaitingForResources State when it has
sufficient resources and it doesn't have the desired resource.
If the _*resource-stabilization-timeout mechanism*_ is moved into the Executing
State, the rescale downtime will be significantly reduced.
h1. 2. Why the downtime is long?
Currently, when rescale a job:
* The Executing will transition to Restarting
* The Restarting will cancel this job first.
* The Restarting will transition to WaitingForResources after the whole job is
terminal.
* When this job has sufficient resources and it doesn't have the desired
resource, the WaitingForResources needs to wait
_*resource-stabilization-timeout*_ .
* WaitingForResources will transition to CreatingExecutionGraph after
resource-stabilization-timeout.
The problem is the job isn't running during the resource-stabilization-timeout
phase.
h1. 3. How to reduce the downtime?
We can move the _*resource-stabilization-timeout mechanism*_ into the Executing
State when trigger a rescale. It means:
* When this job has desired resources, the Executing can rescale directly.
* When this job has sufficient resources and it doesn't have the desired
resource, we can rescale after _*resource-stabilization-timeout.*_
* The WaitingForResources will ignore the resource-stabilization-timeout after
this improvement.
The resource-stabilization-timeout works before cancel job, so the rescale
downtime will be significantly reduced.
Note: the resource-stabilization-timeout still works in WaitingForResources
when start a job. It's just changed when rescale a job.
> Improve the resource-stabilization-timeout mechanism when rescale a job for
> Adaptive Scheduler
> ----------------------------------------------------------------------------------------------
>
> Key: FLINK-33092
> URL: https://issues.apache.org/jira/browse/FLINK-33092
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
> Attachments: image-2023-09-15-14-43-35-104.png
>
>
> !image-2023-09-15-14-43-35-104.png|width=916,height=647!
> h1. 1. Propose
> The above is the state transition graph when rescale a job in Adaptive
> Scheduler.
> In brief, when we trigger a rescale, the job will wait
> _*resource-stabilization-timeout*_ in WaitingForResources State when it has
> sufficient resources and it doesn't have the desired resource.
> If the _*resource-stabilization-timeout mechanism*_ is moved into the
> Executing State, the rescale downtime will be significantly reduced.
> h1. 2. Why the downtime is long?
> Currently, when rescale a job:
> * The Executing will transition to Restarting
> * The Restarting will cancel this job first.
> * The Restarting will transition to WaitingForResources after the whole job
> is terminal.
> * When this job has sufficient resources and it doesn't have the desired
> resource, the WaitingForResources needs to wait
> _*resource-stabilization-timeout*_ .
> * WaitingForResources will transition to CreatingExecutionGraph after
> resource-stabilization-timeout.
> The problem is the job isn't running during the
> resource-stabilization-timeout phase.
> h1. 3. How to reduce the downtime?
> We can move the _*resource-stabilization-timeout mechanism*_ into the
> Executing State when trigger a rescale. It means:
> * When this job has desired resources, the Executing can rescale directly.
> * When this job has sufficient resources and it doesn't have the desired
> resource, we can rescale after _*resource-stabilization-timeout.*_
> * The WaitingForResources will ignore the resource-stabilization-timeout
> after this improvement.
> The resource-stabilization-timeout works before cancel job, so the rescale
> downtime will be significantly reduced.
>
> Note: the resource-stabilization-timeout still works in WaitingForResources
> when start a job. It's just changed when rescale a job.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)