X-czh commented on code in PR #581:
URL:
https://github.com/apache/flink-kubernetes-operator/pull/581#discussion_r1188100551
##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/config/AutoScalerOptions.java:
##########
@@ -98,6 +98,13 @@ private static ConfigOptions.OptionBuilder
autoScalerConfig(String key) {
.withDescription(
"Max scale down factor. 1 means no limit on scale
down, 0.6 means job can only be scaled down with 60% of the original
parallelism.");
+ public static final ConfigOption<Double> MAX_SCALE_UP_FACTOR =
+ autoScalerConfig("scale-up.max-factor")
+ .doubleType()
+ .defaultValue(2.0)
Review Comment:
Hi @mxm, I'll expand a bit on (4). In case of a data center failure, we will
go through a disaster recovery process, jobs will be migrated to another data
center to recover. During the process,
- A huge backlog will be built up, and when these jobs recover, all jobs
tend to scale up but the overall resource is limited.
- Many external services might not recover in time or have to lower capacity
for other higher priority services, this will cause Flink jobs replying on them
process slower and tend to scale up (and very likely scale up a lot!) as well.
As a result, almost all the jobs will tend to scale up a lot during this
process, and we need to limit the single-step scale-up behavior so that not too
much resources are occupied by a small set of jobs before automatic scaling
circuit breaker and human intervention to take action.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]