Trystan created FLINK-39925:
-------------------------------
Summary: Job throughput metrics incorrectly dropping to zero,
forcing scale down
Key: FLINK-39925
URL: https://issues.apache.org/jira/browse/FLINK-39925
Project: Flink
Issue Type: Bug
Components: Autoscaler
Affects Versions: 1.14.0
Reporter: Trystan
Attachments: Screenshot 2026-06-12 at 1.31.34 PM.png, Screenshot
2026-06-12 at 1.34.43 PM.png
Over the last few days I have noticed that the autoscaler will start somehow
collecting zeros for throughput metrics. The values drop to zero over the
course of about half an hour. This causes the autoscaler to continue scaling
down even when it should not. The busy percentage is still very high, but the
operator seems to no longer be taking this into account.
We are using more or less all the default operator config values (not helm
defaults, but actual operator defaults). `job.autoscaler.metrics.window` is 30m
for each job, which matches the time when values finally drop to zero.
Redeploying the job resets the metrics and the values are populated correctly.
We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink
1.18.1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)