[ 
https://issues.apache.org/jira/browse/FLINK-39925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089683#comment-18089683
 ] 

Trystan commented on FLINK-39925:
---------------------------------

Thanks for taking a look at this, Swati. In case this helps, I have a bit more 
context/discovery to add to this.

 

On a job currently stuck in this situation, the actual Flink UI is returning 
zeros for metrics as well - on every refresh. The Flink UI shows `loading...` 
and just empty for everything. I'm including the whole (somewhat anonymized) 
payload below in case it is helpful. So I think there may be a two-fold 
problem: the JM is unable to collect these metrics, and I'm not sure why just 
yet - I will do some more experimentation today. Prometheus is perfectly able 
to scrape and show metrics, so there's something else that I don't yet 
understand about how the JM accesses or serves these metrics.

 

But the autoscaler is seeing _apparently_ valid zeros. They're not actually 
unavailable, they're just zeros.

 

 
{code:java}
GET flinkui/jobs/<job_id>

{
    "jid": "<job_id>",
    "name": "<job_name>",
    "isStoppable": false,
    "state": "RUNNING",
    "start-time": 1776097725815,
    "end-time": -1,
    "duration": 5609170974,
    "maxParallelism": -1,
    "now": 1781706896789,
    "timestamps": {
        "RUNNING": 1781653433979,
        "SUSPENDED": 0,
        "CANCELLING": 0,
        "CANCELED": 0,
        "RESTARTING": 0,
        "RECONCILING": 0,
        "INITIALIZING": 1776097725815,
        "FAILED": 0,
        "FAILING": 0,
        "FINISHED": 0,
        "CREATED": 1781653433967
    },
    "vertices": [
        {
            "id": "<vertex_a>",
            "name": "<vertex_a_name>",
            "maxParallelism": 128,
            "parallelism": 18,
            "status": "RUNNING",
            "start-time": 1781653433981,
            "end-time": -1,
            "duration": 53462808,
            "tasks": {
                "SCHEDULED": 0,
                "FAILED": 0,
                "RUNNING": 18,
                "CANCELED": 0,
                "DEPLOYING": 0,
                "FINISHED": 0,
                "CANCELING": 0,
                "RECONCILING": 0,
                "CREATED": 0,
                "INITIALIZING": 0
            },
            "metrics": {
                "read-bytes": 0,
                "read-bytes-complete": false,
                "write-bytes": 0,
                "write-bytes-complete": false,
                "read-records": 0,
                "read-records-complete": false,
                "write-records": 0,
                "write-records-complete": false,
                "accumulated-backpressured-time": 0,
                "accumulated-idle-time": 0,
                "accumulated-busy-time": 0.0
            }
        },
        {
            "id": "<vertex_b>",
            "name": "<vertex_b_name>",
            "maxParallelism": 128,
            "parallelism": 20,
            "status": "RUNNING",
            "start-time": 1781653433983,
            "end-time": -1,
            "duration": 53462806,
            "tasks": {
                "SCHEDULED": 0,
                "FAILED": 0,
                "RUNNING": 20,
                "CANCELED": 0,
                "DEPLOYING": 0,
                "FINISHED": 0,
                "CANCELING": 0,
                "RECONCILING": 0,
                "CREATED": 0,
                "INITIALIZING": 0
            },
            "metrics": {
                "read-bytes": 0,
                "read-bytes-complete": false,
                "write-bytes": 0,
                "write-bytes-complete": false,
                "read-records": 0,
                "read-records-complete": false,
                "write-records": 0,
                "write-records-complete": false,
                "accumulated-backpressured-time": 0,
                "accumulated-idle-time": 0,
                "accumulated-busy-time": 0.0
            }
        }
    ],
    "status-counts": {
        "SCHEDULED": 0,
        "FAILED": 0,
        "RUNNING": 2,
        "CANCELED": 0,
        "DEPLOYING": 0,
        "FINISHED": 0,
        "CANCELING": 0,
        "RECONCILING": 0,
        "CREATED": 0,
        "INITIALIZING": 0
    },
    "plan": {
        "jid": "<job_id>",
        "name": "<job_name>",
        "type": "STREAMING",
        "nodes": [
            {
                "id": "<vertex_a_id>",
                "parallelism": 18,
                "operator": "",
                "operator_strategy": "",
                "description": "<description>",
                "optimizer_properties": {}
            },
            {
                "id": "<vertex_b_id>",
                "parallelism": 20,
                "operator": "",
                "operator_strategy": "",
                "description": "<description>",
                "inputs": [
                    {
                        "num": 0,
                        "id": "<vertex_a_id>",
                        "ship_strategy": "HASH",
                        "exchange": "pipelined_bounded"
                    }
                ],
                "optimizer_properties": {}
            }
        ]
    }
}{code}
 

 

> Job throughput metrics incorrectly dropping to zero, forcing scale down
> -----------------------------------------------------------------------
>
>                 Key: FLINK-39925
>                 URL: https://issues.apache.org/jira/browse/FLINK-39925
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Trystan
>            Assignee: Swati Gupta
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Screenshot 2026-06-12 at 1.34.43 PM.png, Screenshot 
> 2026-06-12 at 1.46.26 PM.png
>
>
> Over the last few days I have noticed that the autoscaler will start somehow 
> collecting zeros for throughput metrics. The values drop to zero over the 
> course of about half an hour. This causes the autoscaler to continue scaling 
> down even when it should not. The busy percentage is still very high, but the 
> operator seems to no longer be taking this into account.
> We are using more or less all the default operator config values (not helm 
> defaults, but actual operator defaults). `job.autoscaler.metrics.window` is 
> 30m for each job, which matches the time when values finally drop to zero.
> Redeploying the job resets the metrics and the values are populated correctly.
> We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink 
> 1.18.1.
> Around the same time, we see logs indicating the output ratio between edges 
> dropping to zero:
> {code:java}
> Computed output ratio for edge (a -> b) : 70.00000000033906"
> Computed output ratio for edge (a -> b) : 29.500000000536925"
> Computed output ratio for edge (a -> b) : 24.49999999973847"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0" {code}
> (in the Scaling Bounds screenshot, the yellow line is 
> {*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the 
> blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current* 
> and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to