swatiksi273-ksolves commented on PR #1136: URL: https://github.com/apache/flink-kubernetes-operator/pull/1136#issuecomment-4732131294
Update on root cause analysis: After reviewing the additional context provided by the reporter, I want to clarify the assumption behind this PR. When I initially analyzed the issue, I looked at the computeEdgeOutputRatio method in ScalingMetricEvaluator and noticed that it defaults outputRatio to 0.0. I assumed that metrics were becoming temporarily unavailable (returning NaN) and the default 0.0 was causing the incorrect scale down. However, based on the reporter's latest comment, the Flink REST API is actually returning genuine zeros for all metrics (read-records, write-records, accumulated-busy-time) while the job is clearly busy. So the zeros are not coming from a NaN fallback — they are coming directly from the REST API. This PR still improves the NaN handling in computeEdgeOutputRatio and is a valid defensive fix, but it may not fully resolve the reported issue. The actual root cause appears to be in the metric collection layer — specifically why the Flink REST API returns zeros while the job is running and busy. I am continuing to investigate this and will raise a follow-up PR if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
