[jira] [Closed] (FLINK-34318) AdaptiveScheduler resource stabilisation should happen before the job is cancelled

Gyula Fora (Jira) Wed, 31 Jan 2024 02:07:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-34318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gyula Fora closed FLINK-34318.
------------------------------
    Resolution: Fixed

Closing this as duplicate thanks [~fanrui] 

> AdaptiveScheduler resource stabilisation should happen before the job is 
> cancelled
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-34318
>                 URL: https://issues.apache.org/jira/browse/FLINK-34318
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Gyula Fora
>            Priority: Major
>
> When a new resource requirement is submitted to the AdaptiveScheduler which 
> increases the resource upper bound (max taskmanagers), when the first 
> TaskManager comes up the job is immediately cancelled. 
> Once the job is cancelled the scheduler waits for the entire stabilisation 
> period to pass if it cannot acquire all resources before starting with the 
> lower-than-requested parallelism.
> The problem here is that waiting for resource stabilisation happens after the 
> job is cancelled, introducing unnecessary downtime for the job if the 
> stabilisation period is large.
> We should change logic here to wait for the stabilisation period first to 
> acquire all possible resources before cancelling the job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (FLINK-34318) AdaptiveScheduler resource stabilisation should happen before the job is cancelled

Reply via email to