[jira] [Resolved] (FLINK-36535) Optimize the scale down logic based on historical parallelism

Rui Fan (Jira) Mon, 02 Dec 2024 18:36:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-36535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rui Fan resolved FLINK-36535.
-----------------------------
    Fix Version/s: kubernetes-operator-1.11.0
       Resolution: Fixed

Merged to main(1.11.0) via: d9e8cce85499f26ac0129a2f2d13a083d68b5c21

> Optimize the scale down logic based on historical parallelism
> -------------------------------------------------------------
>
>                 Key: FLINK-36535
>                 URL: https://issues.apache.org/jira/browse/FLINK-36535
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.11.0
>
>
> This is a follow-up to FLINK-36018 . FLINK-36018 supported the lazy scale 
> down to avoid frequent rescaling.
> h1. Proposed Change
> Treat scale-down.interval as a window:
>  * Recording the scale down trigger time when the recommended parallelism < 
> current parallelism
>  ** When the recommended parallelism >= current parallelism, cancel the 
> triggered scale down
>  * The scale down will be executed when currentTime - triggerTime > 
> scale-down.interval
>  ** {color:#de350b}Change1{color}: Using the maximum parallelism within the 
> window instead of the latest parallelism when scaling down.
>  * {color:#de350b}Change2{color}: Never scale down when currentTime - 
> triggerTime < scale-down.interval
>  * 
>  ** In the FLINK-36018, the scale down may be executed when currentTime - 
> triggerTime < scale-down.interval.
>  ** For example: the taskA may scale down when taskB needs to scale up.
> h1. Background
> Some critical Flink jobs need to scale up in time, but only scale down on a 
> daily basis. In other words, Flink users do not want Flink jobs to be scaled 
> down multiple times within 24 hours, and jobs run at the same parallelism as 
> during the peak hours of each day. 
> Note: Users hope to scale down only happens when the parallelism during peak 
> hours still wastes resources. This is a trade-off between downtime and 
> resource waste for a critical job.
> h1. Current solution
> In general, this requirement could be met after setting{color:#de350b} 
> job.autoscaler.scale-down.interval= 24 hour{color}. When taskA runs with 100 
> parallelism, and recommended parallelism is 100 during the peak hours of each 
> day. We hope taskA doesn't rescale forever, because the triggered scale down 
> will be canceled once the recommended parallelism >= current parallelism 
> within 24 hours (It‘s exactly what FLINK-36018 does).
> h1. Unexpected Scenario & how to solve?
> But I found the critical production job is still rescaled about 10 times 
> every day (when scale-down.interval is set to 24 hours).
> Root cause: There may be many sources in a job, and the traffic peaks of 
> these sources may occur at different times. When taskA triggers scale down, 
> the scale down of taskA will not be actively executed within 24 hours, but it 
> may be executed when other tasks are scaled up.
> For example:
>  * The scale down of sourceB and sourceC may be executed when SourceA scales 
> up.
>  * After a while, the scale down of sourceA and sourceC may be executed when 
> SourceB scales up.
>  * After a while, the scale down of sourceA and sourceB may be executed when 
> SourceC scales up.
>  * When there are many tasks, the above 3 steps will be executed repeatedly.
> That's why the job is rescaled about 10 times every day, the 
> {color:#de350b}change2{color} of proposed change could solve this issue: 
> Never scale down when currentTime - triggerTime < scale-down.interval.
>  
> {color:#de350b}Change1{color}: Using the maximum parallelism within the 
> window instead of the latest parallelism when scaling down.
>  * It can ensure that the parallelism after scaling down is the parallelism 
> at yesterday's peak.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36535) Optimize the scale down logic based on historical parallelism

Reply via email to