[jira] [Updated] (FLINK-39925) Job throughput metrics incorrectly dropping to zero, forcing scale down

Trystan (Jira) Fri, 12 Jun 2026 11:58:09 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-39925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Trystan updated FLINK-39925:
----------------------------
    Description: 
Over the last few days I have noticed that the autoscaler will start somehow 
collecting zeros for throughput metrics. The values drop to zero over the 
course of about half an hour. This causes the autoscaler to continue scaling 
down even when it should not. The busy percentage is still very high, but the 
operator seems to no longer be taking this into account.

We are using more or less all the default operator config values (not helm 
defaults, but actual operator defaults). `job.autoscaler.metrics.window` is 30m 
for each job, which matches the time when values finally drop to zero.

Redeploying the job resets the metrics and the values are populated correctly.

We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink 
1.18.1.

Around the same time, we see logs indicating the output ratio between edges 
dropping to zero:
{code:java}
Computed output ratio for edge (a -> b) : 70.00000000033906"
Computed output ratio for edge (a -> b) : 29.500000000536925"
Computed output ratio for edge (a -> b) : 24.49999999973847"
Computed output ratio for edge (a -> b) : 0.0"
Computed output ratio for edge (a -> b) : 0.0"
Computed output ratio for edge (a -> b) : 0.0" {code}
(in the Scaling Bounds screenshot, the yellow line is 
{*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the 
blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current* 
and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})

  was:
Over the last few days I have noticed that the autoscaler will start somehow 
collecting zeros for throughput metrics. The values drop to zero over the 
course of about half an hour. This causes the autoscaler to continue scaling 
down even when it should not. The busy percentage is still very high, but the 
operator seems to no longer be taking this into account.

We are using more or less all the default operator config values (not helm 
defaults, but actual operator defaults). `job.autoscaler.metrics.window` is 30m 
for each job, which matches the time when values finally drop to zero.

Redeploying the job resets the metrics and the values are populated correctly.

We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink 
1.18.1.

 

(in the Scaling Bounds screenshot, the yellow line is 
{*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the 
blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current* 
and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})


> Job throughput metrics incorrectly dropping to zero, forcing scale down
> -----------------------------------------------------------------------
>
>                 Key: FLINK-39925
>                 URL: https://issues.apache.org/jira/browse/FLINK-39925
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Trystan
>            Priority: Major
>         Attachments: Screenshot 2026-06-12 at 1.34.43 PM.png, Screenshot 
> 2026-06-12 at 1.46.26 PM.png
>
>
> Over the last few days I have noticed that the autoscaler will start somehow 
> collecting zeros for throughput metrics. The values drop to zero over the 
> course of about half an hour. This causes the autoscaler to continue scaling 
> down even when it should not. The busy percentage is still very high, but the 
> operator seems to no longer be taking this into account.
> We are using more or less all the default operator config values (not helm 
> defaults, but actual operator defaults). `job.autoscaler.metrics.window` is 
> 30m for each job, which matches the time when values finally drop to zero.
> Redeploying the job resets the metrics and the values are populated correctly.
> We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink 
> 1.18.1.
> Around the same time, we see logs indicating the output ratio between edges 
> dropping to zero:
> {code:java}
> Computed output ratio for edge (a -> b) : 70.00000000033906"
> Computed output ratio for edge (a -> b) : 29.500000000536925"
> Computed output ratio for edge (a -> b) : 24.49999999973847"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0" {code}
> (in the Scaling Bounds screenshot, the yellow line is 
> {*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the 
> blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current* 
> and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-39925) Job throughput metrics incorrectly dropping to zero, forcing scale down

Reply via email to