Trystan created FLINK-39925:
-------------------------------

             Summary: Job throughput metrics incorrectly dropping to zero, 
forcing scale down
                 Key: FLINK-39925
                 URL: https://issues.apache.org/jira/browse/FLINK-39925
             Project: Flink
          Issue Type: Bug
          Components: Autoscaler
    Affects Versions: 1.14.0
            Reporter: Trystan
         Attachments: Screenshot 2026-06-12 at 1.31.34 PM.png, Screenshot 
2026-06-12 at 1.34.43 PM.png

Over the last few days I have noticed that the autoscaler will start somehow 
collecting zeros for throughput metrics. The values drop to zero over the 
course of about half an hour. This causes the autoscaler to continue scaling 
down even when it should not. The busy percentage is still very high, but the 
operator seems to no longer be taking this into account.

We are using more or less all the default operator config values (not helm 
defaults, but actual operator defaults). `job.autoscaler.metrics.window` is 30m 
for each job, which matches the time when values finally drop to zero.

Redeploying the job resets the metrics and the values are populated correctly.

We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink 
1.18.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to