X-czh commented on code in PR #581:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/581#discussion_r1188100551


##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/config/AutoScalerOptions.java:
##########
@@ -98,6 +98,13 @@ private static ConfigOptions.OptionBuilder 
autoScalerConfig(String key) {
                     .withDescription(
                             "Max scale down factor. 1 means no limit on scale 
down, 0.6 means job can only be scaled down with 60% of the original 
parallelism.");
 
+    public static final ConfigOption<Double> MAX_SCALE_UP_FACTOR =
+            autoScalerConfig("scale-up.max-factor")
+                    .doubleType()
+                    .defaultValue(2.0)

Review Comment:
   Hi @mxm, I'll expand a bit on (4). In case of a data center failure, we will 
go through a disaster recovery process, jobs will be migrated to another data 
center to recover. During the process,
   
   - A huge backlog will be built up, and when these jobs recover, all jobs 
tend to scale up but the overall resource is limited.
   - Many external services might not recover in time or have to lower capacity 
for other higher priority services, this will cause Flink jobs replying on them 
process slower and tend to scale up (and very likely scale up a lot!) as well.
   
   As a result, almost all the jobs will tend to scale up a lot during this 
process, and we need to limit the single-step scale-up behavior so that not too 
much resources are occupied by a small set of jobs before automatic scaling 
circuit breaker and human intervention to take action.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to