mxm commented on code in PR #586: URL: https://github.com/apache/flink-kubernetes-operator/pull/586#discussion_r1187270164
########## flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/config/AutoScalerOptions.java: ########## @@ -68,15 +68,16 @@ private static ConfigOptions.OptionBuilder autoScalerConfig(String key) { public static final ConfigOption<Double> TARGET_UTILIZATION_BOUNDARY = autoScalerConfig("target.utilization.boundary") .doubleType() - .defaultValue(0.1) + .defaultValue(0.4) Review Comment: Not at all. The boundary is used to calculate a scale down and a scale up rate. If the processing capacity falls below the scale up rate, we will scale up to reach the target capacity. If we exceeds the scale down rate, we will scale down to the target capacity. This is a bit counter-intuitive because the upper and lower bound are actually reversed. A `1.0` utilization for the upscale threshold will lower the scale up rate which means that we delaying upscale to utilize 100% of our processing capacity based on the calculated rates. However, we will still scale up if our processing capacity is lower than the scale up rate. The scale up rate is always computed via the target rate, but the comparison is made using the actual processing capacity. For example: Let's say we currently have a processing capacity of 100 records/sec. The processing capacity if always estimated at 100% utilization (we call this also *true rate*). At a target rate of 50 records/second (e.g. Kafka ingestion rate), the scale up bound will be 50 rec/s. That means we will only scale up once our processing capacity falls below 50 rec/s. So we will delay scaling as much as possible. If the target rate was to increase to 110 rec/s, we would scale up because our processing capacity of 100 rec/s is now lower. Similarly, the downscale rate will be raised (instead of lowered) when we increase the utilization boundary. That means we won't scale down as quickly. The tradeoff here is slightly higher resource usage but the scaling becomes less aggressive because we will only scale down once our processing capacity exceeds the know increased "lower bound". To illutrate this further here some sketches: Balanced: ``` ------ upscale target rate ------ processing capacity (true rate) ------ downscale target rate ``` We will scale up: ``` ------ downscale rate ------ upscale rate ------ processing capacity (true rate) ``` We will scale down: ``` ------ processing capacity (true rate) ------ downscale rate ------ upscale rate ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org