Re: [PR] [FLINK-36535][autoscaler] Optimize the scale down logic based on historical parallelism to reduce the rescale frequency [flink-kubernetes-operator]

via GitHub Mon, 02 Dec 2024 01:18:24 -0800


1996fanrui commented on code in PR #920:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/920#discussion_r1865495287



##########
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/JobVertexScaler.java:
##########
@@ -289,21 +274,26 @@ private ParallelismChange applyScaleDownInterval(
         var scaleDownInterval = conf.get(SCALE_DOWN_INTERVAL);
         if (scaleDownInterval.toMillis() <= 0) {
             // The scale down interval is disable, so don't block scaling.
-            return ParallelismChange.required(newParallelism);
-        }
-
-        var firstTriggerTime = 
delayedScaleDown.getFirstTriggerTimeForVertex(vertex);
-        if (firstTriggerTime.isEmpty()) {
-            LOG.info("The scale down of {} is delayed by {}.", vertex, 
scaleDownInterval);
-            delayedScaleDown.updateTriggerTime(vertex, clock.instant());
-            return ParallelismChange.optional(newParallelism);
+            return ParallelismChange.build(newParallelism);
         }
 
-        if 
(clock.instant().isBefore(firstTriggerTime.get().plus(scaleDownInterval))) {
-            LOG.debug("Try to skip immediate scale down within scale-down 
interval for {}", vertex);
-            return ParallelismChange.optional(newParallelism);
+        var now = clock.instant();
+        var delayedScaleDownInfo = delayedScaleDown.triggerScaleDown(vertex, 
now, newParallelism);
+
+        // Never scale down within scale down interval
+        if 
(now.isBefore(delayedScaleDownInfo.getFirstTriggerTime().plus(scaleDownInterval)))
 {

Review Comment:
   Thanks @mxm  for the clarification!
   
   Actually, we expect to use the first scale down time instead of scale up 
time as the first trigger time. 
   
   - job.autoscaler.scale-up.grace-period hopes to use  scale up time
   - And job.autoscaler.scale-down.interval hopes to use the first scale down 
time, it `scale-down.interval` is 1 hour, we expect the scale down can be 
executed after 1 hour.
       - It's delayed scale down
       - And It could merge multiple scale down requests into one scale down 
execution.
   - If scale-down.interval uses the scale up time, when scale down request 
comes, job will execute scale down directly if job is scaled up 1 hour ago. (It 
cannot merge multiple scale down request into one execution)
   
   As I understand, the scale down interval wanna to reduce the scale down 
frequency.
   
   Please correct me if anything is wrong, thank you~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-36535][autoscaler] Optimize the scale down logic based on historical parallelism to reduce the rescale frequency [flink-kubernetes-operator]

Reply via email to