Re: [PR] [FLINK-39925][Autoscaler] Return NaN instead of 0 for edge output ratio when input metrics are unavailable [flink-kubernetes-operator]

via GitHub Wed, 17 Jun 2026 08:16:44 -0700


swatiksi273-ksolves commented on PR #1136:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/1136#issuecomment-4732131294


   Update on root cause analysis:
   After reviewing the additional context provided by the reporter, I want to 
clarify the assumption behind this PR.
   When I initially analyzed the issue, I looked at the computeEdgeOutputRatio 
method in ScalingMetricEvaluator and noticed that it defaults outputRatio to 
0.0. I assumed that metrics were becoming temporarily unavailable (returning 
NaN) and the default 0.0 was causing the incorrect scale down.
   However, based on the reporter's latest comment, the Flink REST API is 
actually returning genuine zeros for all metrics (read-records, write-records, 
accumulated-busy-time) while the job is clearly busy. So the zeros are not 
coming from a NaN fallback — they are coming directly from the REST API.
   This PR still improves the NaN handling in computeEdgeOutputRatio and is a 
valid defensive fix, but it may not fully resolve the reported issue.
   The actual root cause appears to be in the metric collection layer — 
specifically why the Flink REST API returns zeros while the job is running and 
busy. I am continuing to investigate this and will raise a follow-up PR if 
needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-39925][Autoscaler] Return NaN instead of 0 for edge output ratio when input metrics are unavailable [flink-kubernetes-operator]

Reply via email to