[jira] [Commented] (FLINK-39925) Job throughput metrics incorrectly dropping to zero, forcing scale down

Swati Gupta (Jira) Mon, 15 Jun 2026 23:59:10 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-39925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089302#comment-18089302
 ]


Swati Gupta commented on FLINK-39925:
-------------------------------------

Hi, I've been investigating this issue and I believe I've identified the root 
cause in {{{}ScalingMetricEvaluator.java{}}}. The {{computeEdgeOutputRatio}} 
method defaults {{outputRatio}} to {{0.0}} when input metrics are temporarily 
unavailable, instead of returning {{{}Double.NaN{}}}. This causes the 
autoscaler to treat the zero as valid throughput data, ultimately driving 
{{TARGET_DATA_RATE}} to zero and triggering an incorrect scale down even when 
the job is busy.

I have a fix ready and would like to contribute a patch for this. Could someone 
please assign this ticket to me? I'll raise a PR shortly.

Thanks!

> Job throughput metrics incorrectly dropping to zero, forcing scale down
> -----------------------------------------------------------------------
>
>                 Key: FLINK-39925
>                 URL: https://issues.apache.org/jira/browse/FLINK-39925
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Trystan
>            Priority: Major
>         Attachments: Screenshot 2026-06-12 at 1.34.43 PM.png, Screenshot 
> 2026-06-12 at 1.46.26 PM.png
>
>
> Over the last few days I have noticed that the autoscaler will start somehow 
> collecting zeros for throughput metrics. The values drop to zero over the 
> course of about half an hour. This causes the autoscaler to continue scaling 
> down even when it should not. The busy percentage is still very high, but the 
> operator seems to no longer be taking this into account.
> We are using more or less all the default operator config values (not helm 
> defaults, but actual operator defaults). `job.autoscaler.metrics.window` is 
> 30m for each job, which matches the time when values finally drop to zero.
> Redeploying the job resets the metrics and the values are populated correctly.
> We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink 
> 1.18.1.
> Around the same time, we see logs indicating the output ratio between edges 
> dropping to zero:
> {code:java}
> Computed output ratio for edge (a -> b) : 70.00000000033906"
> Computed output ratio for edge (a -> b) : 29.500000000536925"
> Computed output ratio for edge (a -> b) : 24.49999999973847"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0" {code}
> (in the Scaling Bounds screenshot, the yellow line is 
> {*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the 
> blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current* 
> and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-39925) Job throughput metrics incorrectly dropping to zero, forcing scale down

Reply via email to