Re: [PR] [FLINK-32002] Adjust autoscaler defaults for release [flink-kubernetes-operator]

via GitHub Tue, 30 Apr 2024 09:02:25 -0700


trystanj commented on code in PR #586:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/586#discussion_r1585036601



##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/config/AutoScalerOptions.java:
##########
@@ -68,15 +68,16 @@ private static ConfigOptions.OptionBuilder 
autoScalerConfig(String key) {
     public static final ConfigOption<Double> TARGET_UTILIZATION_BOUNDARY =
             autoScalerConfig("target.utilization.boundary")
                     .doubleType()
-                    .defaultValue(0.1)
+                    .defaultValue(0.4)

Review Comment:
   Great, thank you!
   
   Yes pendingRecords is present in all jobs and has a value (fluctuating, of 
course). I have never seen the operator's metric 
`flink_k8soperator_namespace_resource_AutoScaler_jobVertexID_LAG_Current` be 
anything other than NaN, though. I'm observing it via prometheus, so maybe it's 
just a bug in the translation layer.
   
   edit: actually, same for `SOURCE_DATA_RATE` and `CURRENT_PROCESSING_RATE`...
   
   (also, if this conversation is out of scope I'd be happy to move it to 
somewhere that is less tangential!)



##########
flink-kubernetes-operator-autoscaler/src/main/java/org/apache/flink/kubernetes/operator/autoscaler/config/AutoScalerOptions.java:
##########
@@ -68,15 +68,16 @@ private static ConfigOptions.OptionBuilder 
autoScalerConfig(String key) {
     public static final ConfigOption<Double> TARGET_UTILIZATION_BOUNDARY =
             autoScalerConfig("target.utilization.boundary")
                     .doubleType()
-                    .defaultValue(0.1)
+                    .defaultValue(0.4)

Review Comment:
   Thanks, that makes a lot of sense! Is catch up status determined by literal 
timestamps compared against the catch up duration? eg if a record was placed in 
kafka 10m ago, and our expected catch up duration is 5m, then are we 5m behind, 
or are we still 10m behind? or is catch up determined by throughput numbers? 
just trying to get a better sense of "catch up" statistics!
   
   Perhaps our problem is that lag and source_date_rate -for every single job 
tracked (operator 1.7, Flink 1.18.1, all using `KafkaSource`) -  is `NaN`. At 
least according to the exposed operator metrics themselves. If the operator 
can't see the lag then maybe it can't make an informed decision? I'm wondering 
if this is a bug on our configuration or maybe I'm just way off base. I should 
expect to see values for `LAG_Current` and `SOURCE_DATA_RATE_Current`, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-32002] Adjust autoscaler defaults for release [flink-kubernetes-operator]

Reply via email to