[
https://issues.apache.org/jira/browse/FLINK-39925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089702#comment-18089702
]
Trystan commented on FLINK-39925:
---------------------------------
# It appears to be somewhat random. One job had this issue after ~21d of up
time, another ~30d, another encountered it ~58d of uptime. Of course there have
been restarts in there, either due to autoscaler triggers or transient pod/node
failures.
# All jobs run on k8s in native application mode. Still digging into details,
but in fact there does appear to be _some kind_ of network issue when this
happens. New TMs are unable to spin up and join the cluster - I had assumed
these were separate issues when reporting this, but the more I find new info
the more I suspect they are related. Redeploying always fixes both issues. My
next step is to see whether simply killing the TM resolves it too, but I
haven't done that yet so I can preserve a known-bad state for investigation.
# The Flink UI shows {{loading...}} . I'm attaching a screenshot.
# In the payload above, the fact that {{-complete}} is *false* is interesting.
For a healthy job it looks like these should be {*}true{*}. Based on
MutableIOMetrics.java, I'm inclined to believe either the metrics are
completely missing or they're unable to be fetched.
# !flink-ui-no-metrics.jpg!
> Job throughput metrics incorrectly dropping to zero, forcing scale down
> -----------------------------------------------------------------------
>
> Key: FLINK-39925
> URL: https://issues.apache.org/jira/browse/FLINK-39925
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Trystan
> Assignee: Swati Gupta
> Priority: Major
> Labels: pull-request-available
> Attachments: Screenshot 2026-06-12 at 1.34.43 PM.png, Screenshot
> 2026-06-12 at 1.46.26 PM.png, flink-ui-no-metrics.jpg
>
>
> Over the last few days I have noticed that the autoscaler will start somehow
> collecting zeros for throughput metrics. The values drop to zero over the
> course of about half an hour. This causes the autoscaler to continue scaling
> down even when it should not. The busy percentage is still very high, but the
> operator seems to no longer be taking this into account.
> We are using more or less all the default operator config values (not helm
> defaults, but actual operator defaults). `job.autoscaler.metrics.window` is
> 30m for each job, which matches the time when values finally drop to zero.
> Redeploying the job resets the metrics and the values are populated correctly.
> We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink
> 1.18.1.
> Around the same time, we see logs indicating the output ratio between edges
> dropping to zero:
> {code:java}
> Computed output ratio for edge (a -> b) : 70.00000000033906"
> Computed output ratio for edge (a -> b) : 29.500000000536925"
> Computed output ratio for edge (a -> b) : 24.49999999973847"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0" {code}
> (in the Scaling Bounds screenshot, the yellow line is
> {*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the
> blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current*
> and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})
--
This message was sent by Atlassian Jira
(v8.20.10#820010)