swatiksi273-ksolves commented on code in PR #1139:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/1139#discussion_r3438211925


##########
flink-autoscaler/src/main/java/org/apache/flink/autoscaler/metrics/ScalingMetrics.java:
##########
@@ -83,6 +83,15 @@ public static void computeDataRateMetrics(
         var isSource = topology.isSource(jobVertexID);
         var ioMetrics = topology.get(jobVertexID).getIoMetrics();
 
+        if (!ioMetrics.isMetricsComplete()) {
+            LOG.warn(
+                    "Incomplete IO metrics for vertex {}, skipping scaling 
decision to avoid incorrect scale down.",
+                    jobVertexID);
+            scalingMetrics.put(ScalingMetric.NUM_RECORDS_IN, Double.NaN);
+            scalingMetrics.put(ScalingMetric.NUM_RECORDS_OUT, Double.NaN);

Review Comment:
   Hi @Dennis-Mircea , thanks for the feedback!
   
   I've updated the PR based on your suggestion. The fix is now in 
ScalingMetricCollector.getJobTopology() instead of ScalingMetrics.java, no NaN 
anywhere. When any vertex has read-records-complete: false or 
write-records-complete: false, we now throw NotReadyException directly, which 
causes the autoscaler to skip the entire collection cycle and retry next 
interval.
   
   Changes in this update:
   1. Reverted all changes to IOMetrics.java and ScalingMetrics.java
   2. Fixed ScalingMetricCollector.java, checks complete flags before building 
the metrics map
   3. Added testIncompleteIoMetricsThrowsNotReadyException test using the exact 
REST API payload reported by Trystan
   
   Only 2 files changed in production code: ScalingMetricCollector.java (+12 
lines) and ScalingMetricCollectorTest.java (+1 test).
   
   Regarding cluster testing: I tested on minikube with the fix deployed. The 
complete: false window is very short in minikube since all pods run on the same 
node, but the root cause has been confirmed by the reporter (Trystan) on a real 
cluster, killing the JM and restarting resolved the issue with metrics 
returning complete: true after restart.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to