[
https://issues.apache.org/jira/browse/FLINK-36018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-36018:
-----------------------------------
Labels: pull-request-available (was: )
> Support lazy scale down to avoid frequent rescaling
> ---------------------------------------------------
>
> Key: FLINK-36018
> URL: https://issues.apache.org/jira/browse/FLINK-36018
> Project: Flink
> Issue Type: Improvement
> Components: Autoscaler
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
> Labels: pull-request-available
>
> {*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent
> lags, and make scaling down insensitive to reduce restart frequency.
> h1. Background & Motivation
> We enabled autoscaler scaling for a few flink production jobs. It works with
> Adaptive Scheduler and Rescale api.
> Scaling results:
> * The recommended parallelism meets expectations most of the time
> * When the source traffic increases, the autoscaler scales up the job in
> time to prevent lags.
> * When the source traffic decreases, the autoscaler scales down job in time
> to save resources
> * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a
> day (job.autoscaler.metrics.window=15 min by default).
> As we all know, the job will be unavailable for a while during the restart
> for some reasons:
> * Cancel job
> * Request resources(
> [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
> is optimizing it)
> * Initialize task
> * Restore state
> * Catch up lag during restart
> * etc
> *{color:#de350b}Expectations:{color}*
> * Scaling up in time to prevent lags.
> * Lazy scaling down to reduce downtime and ensure resources can be released
> later.
> h1. Solution:
> * Introduce job.autoscaler.scale-down.interval, the default value could be 1
> hour.
> * Replace job.autoscaler.scale-up.grace-period with
> job.autoscaler.scale-down.interval
> Detailed strategies:
> * Record the start time of the first scale-down event for each vertex
> separately. For example:
> ** vertex1: 2024-08-09 01:35:02
> ** vertex2: 2024-08-09 01:38:02
> * Scaling down will be triggered for some cases:
> ** Any vertex needs scale up
> *** Job restart cannot be avoided, so trigger scale down for another vertex
> as well if needed
> *** After scale down, clean up the start time of scale-down.
> ** The scale down lazy period for any vertex is coming
> *** current time - min(start time for each vertex) > scale-down.lazy-period
> *** This means that there is no scaling up during the scaling down lazy
> period
> Note1: If the recommend parallelism >= current parallelism, the start time of
> scale-down will be cleaned up for current vertex.
> Note2: The recommended parallelism still comes from the latest 15-minute
> metrics.For example:
> * The current parallelism of vertex1 is 100, the traffic is decreased at
> night.
> * 2024-08-09 01:00:00, the recommended parallelism is 60.
> ** The start time of scale down is 2024-08-09 01:00:00.
> * 2024-08-09 01:15:00, the recommended parallelism is 50.
> ** Still within the range of scale down lazy period.
> ** Don't update the start time of scale down.
> * 2024-08-09 01:31:00, the recommended parallelism is 40.
> ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the
> recommended parallelism.
> ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09
> 01:16:00 to 2024-08-09 01:31:00
> Note3: If users set job.autoscaler.scale-down.interval <=0, we scale down
> directly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)