[
https://issues.apache.org/jira/browse/FLINK-39925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089693#comment-18089693
]
Swati Gupta commented on FLINK-39925:
-------------------------------------
Hi Trystan, thanks for the additional context and the screenshots!
Looking at your payload, I can see that {{{}accumulated-busy-time{}}},
{{read-records}} and {{write-records}} are all returning {{0}} from the Flink
REST API, even though the job is clearly busy. This confirms the issue is
happening at the metric collection layer, not in the autoscaler evaluator.
Could you help me understand a bit more:
# Does this happen after a specific uptime duration, or randomly?
# Are you running on Kubernetes? If yes, do you see any pod restarts or
network issues around the time metrics drop to zero?
# Does the Flink UI also show zeros for these metrics at the same time?
This will help narrow down whether the issue is in how the JM collects metrics
from TaskManagers, or how it serves them via the REST API.
Thanks!
> Job throughput metrics incorrectly dropping to zero, forcing scale down
> -----------------------------------------------------------------------
>
> Key: FLINK-39925
> URL: https://issues.apache.org/jira/browse/FLINK-39925
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Trystan
> Assignee: Swati Gupta
> Priority: Major
> Labels: pull-request-available
> Attachments: Screenshot 2026-06-12 at 1.34.43 PM.png, Screenshot
> 2026-06-12 at 1.46.26 PM.png
>
>
> Over the last few days I have noticed that the autoscaler will start somehow
> collecting zeros for throughput metrics. The values drop to zero over the
> course of about half an hour. This causes the autoscaler to continue scaling
> down even when it should not. The busy percentage is still very high, but the
> operator seems to no longer be taking this into account.
> We are using more or less all the default operator config values (not helm
> defaults, but actual operator defaults). `job.autoscaler.metrics.window` is
> 30m for each job, which matches the time when values finally drop to zero.
> Redeploying the job resets the metrics and the values are populated correctly.
> We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink
> 1.18.1.
> Around the same time, we see logs indicating the output ratio between edges
> dropping to zero:
> {code:java}
> Computed output ratio for edge (a -> b) : 70.00000000033906"
> Computed output ratio for edge (a -> b) : 29.500000000536925"
> Computed output ratio for edge (a -> b) : 24.49999999973847"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0" {code}
> (in the Scaling Bounds screenshot, the yellow line is
> {*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the
> blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current*
> and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})
--
This message was sent by Atlassian Jira
(v8.20.10#820010)