Rui Fan created FLINK-36018:
-------------------------------
Summary: Support lazy scale down to avoid frequent rescaling
Key: FLINK-36018
URL: https://issues.apache.org/jira/browse/FLINK-36018
Project: Flink
Issue Type: Improvement
Components: Autoscaler
Reporter: Rui Fan
Assignee: Rui Fan
h1. Background & Motivation
We enabled autoscaler scaling for a few flink production jobs. It works with
Adaptive Scheduler and Rescale api.
Scaling results:
* The recommended parallelism meets expectations most of the time
* When the source traffic increases, the autoscaler scales up the job in time
to prevent lags.
* When the source traffic decreases, the autoscaler scales down job in time to
save resources
* {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a
day (job.autoscaler.metrics.window=15 min by default).
As we all know, the job will be unavailable for a while during the restart for
some reasons:
* Cancel job
* Request resources(
[FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
is optimizing it)
* Initialize task
* Restore state
* Catch up lag during restart
* etc
*{color:#de350b}Expectations:{color}*
* Scaling up in time to prevent lags.
* Lazy scaling down to reduce downtime and ensure resources can be released
later.
h1. Solution:
Introduce job.autoscaler.scale-down.lazy-period, the default value could be 30
min.
Detailed strategies:
* Record the start time of the first scale-down event for each vertex
separately. For example:
** vertex1: 2024-08-09 01:35:02
** vertex2: 2024-08-09 01:38:02
* Scaling down will be triggered for some cases:
** Any vertex needs scale up
*** Job restart cannot be avoided, so trigger scale down for another vertex as
well if needed
*** After scale down, clean up the start time of scale-down.
** The scale down lazy period for any vertex is coming
*** current time - min(start time for each vertex) > scale-down.lazy-period
*** This means that there is no scaling up during the scaling down lazy period
Note1: If the recommend parallelism >= current parallelism, the start time of
scale-down will be cleaned up for current vertex.
Note2: The recommended parallelism still comes from the latest 15-minute
metrics.For example:
* The current parallelism of vertex1 is 100, the traffic is decreased at night.
* 2024-08-09 01:00:00, the recommended parallelism is 60.
** The start time of scale down is 2024-08-09 01:00:00.
* 2024-08-09 01:15:00, the recommended parallelism is 50.
** Still within the range of scale down lazy period.
** Don't update the start time of scale down.
* 2024-08-09 01:31:00, the recommended parallelism is 40.
** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the
recommended parallelism.
** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09
01:16:00 to 2024-08-09 01:31:00
--
This message was sent by Atlassian Jira
(v8.20.10#820010)