[
https://issues.apache.org/jira/browse/FLINK-39925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18089683#comment-18089683
]
Trystan commented on FLINK-39925:
---------------------------------
Thanks for taking a look at this, Swati. In case this helps, I have a bit more
context/discovery to add to this.
On a job currently stuck in this situation, the actual Flink UI is returning
zeros for metrics as well - on every refresh. The Flink UI shows `loading...`
and just empty for everything. I'm including the whole (somewhat anonymized)
payload below in case it is helpful. So I think there may be a two-fold
problem: the JM is unable to collect these metrics, and I'm not sure why just
yet - I will do some more experimentation today. Prometheus is perfectly able
to scrape and show metrics, so there's something else that I don't yet
understand about how the JM accesses or serves these metrics.
But the autoscaler is seeing _apparently_ valid zeros. They're not actually
unavailable, they're just zeros.
{code:java}
GET flinkui/jobs/<job_id>
{
"jid": "<job_id>",
"name": "<job_name>",
"isStoppable": false,
"state": "RUNNING",
"start-time": 1776097725815,
"end-time": -1,
"duration": 5609170974,
"maxParallelism": -1,
"now": 1781706896789,
"timestamps": {
"RUNNING": 1781653433979,
"SUSPENDED": 0,
"CANCELLING": 0,
"CANCELED": 0,
"RESTARTING": 0,
"RECONCILING": 0,
"INITIALIZING": 1776097725815,
"FAILED": 0,
"FAILING": 0,
"FINISHED": 0,
"CREATED": 1781653433967
},
"vertices": [
{
"id": "<vertex_a>",
"name": "<vertex_a_name>",
"maxParallelism": 128,
"parallelism": 18,
"status": "RUNNING",
"start-time": 1781653433981,
"end-time": -1,
"duration": 53462808,
"tasks": {
"SCHEDULED": 0,
"FAILED": 0,
"RUNNING": 18,
"CANCELED": 0,
"DEPLOYING": 0,
"FINISHED": 0,
"CANCELING": 0,
"RECONCILING": 0,
"CREATED": 0,
"INITIALIZING": 0
},
"metrics": {
"read-bytes": 0,
"read-bytes-complete": false,
"write-bytes": 0,
"write-bytes-complete": false,
"read-records": 0,
"read-records-complete": false,
"write-records": 0,
"write-records-complete": false,
"accumulated-backpressured-time": 0,
"accumulated-idle-time": 0,
"accumulated-busy-time": 0.0
}
},
{
"id": "<vertex_b>",
"name": "<vertex_b_name>",
"maxParallelism": 128,
"parallelism": 20,
"status": "RUNNING",
"start-time": 1781653433983,
"end-time": -1,
"duration": 53462806,
"tasks": {
"SCHEDULED": 0,
"FAILED": 0,
"RUNNING": 20,
"CANCELED": 0,
"DEPLOYING": 0,
"FINISHED": 0,
"CANCELING": 0,
"RECONCILING": 0,
"CREATED": 0,
"INITIALIZING": 0
},
"metrics": {
"read-bytes": 0,
"read-bytes-complete": false,
"write-bytes": 0,
"write-bytes-complete": false,
"read-records": 0,
"read-records-complete": false,
"write-records": 0,
"write-records-complete": false,
"accumulated-backpressured-time": 0,
"accumulated-idle-time": 0,
"accumulated-busy-time": 0.0
}
}
],
"status-counts": {
"SCHEDULED": 0,
"FAILED": 0,
"RUNNING": 2,
"CANCELED": 0,
"DEPLOYING": 0,
"FINISHED": 0,
"CANCELING": 0,
"RECONCILING": 0,
"CREATED": 0,
"INITIALIZING": 0
},
"plan": {
"jid": "<job_id>",
"name": "<job_name>",
"type": "STREAMING",
"nodes": [
{
"id": "<vertex_a_id>",
"parallelism": 18,
"operator": "",
"operator_strategy": "",
"description": "<description>",
"optimizer_properties": {}
},
{
"id": "<vertex_b_id>",
"parallelism": 20,
"operator": "",
"operator_strategy": "",
"description": "<description>",
"inputs": [
{
"num": 0,
"id": "<vertex_a_id>",
"ship_strategy": "HASH",
"exchange": "pipelined_bounded"
}
],
"optimizer_properties": {}
}
]
}
}{code}
> Job throughput metrics incorrectly dropping to zero, forcing scale down
> -----------------------------------------------------------------------
>
> Key: FLINK-39925
> URL: https://issues.apache.org/jira/browse/FLINK-39925
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Trystan
> Assignee: Swati Gupta
> Priority: Major
> Labels: pull-request-available
> Attachments: Screenshot 2026-06-12 at 1.34.43 PM.png, Screenshot
> 2026-06-12 at 1.46.26 PM.png
>
>
> Over the last few days I have noticed that the autoscaler will start somehow
> collecting zeros for throughput metrics. The values drop to zero over the
> course of about half an hour. This causes the autoscaler to continue scaling
> down even when it should not. The busy percentage is still very high, but the
> operator seems to no longer be taking this into account.
> We are using more or less all the default operator config values (not helm
> defaults, but actual operator defaults). `job.autoscaler.metrics.window` is
> 30m for each job, which matches the time when values finally drop to zero.
> Redeploying the job resets the metrics and the values are populated correctly.
> We recently upgraded the operator from 1.9.0 to 1.14.0. We are running Flink
> 1.18.1.
> Around the same time, we see logs indicating the output ratio between edges
> dropping to zero:
> {code:java}
> Computed output ratio for edge (a -> b) : 70.00000000033906"
> Computed output ratio for edge (a -> b) : 29.500000000536925"
> Computed output ratio for edge (a -> b) : 24.49999999973847"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0"
> Computed output ratio for edge (a -> b) : 0.0" {code}
> (in the Scaling Bounds screenshot, the yellow line is
> {*}{{AutoScaler_jobVertexID_TRUE_PROCESSING_RATE_Average}}{*}{{{}, while the
> blue bounds are {}}}*AutoScaler_jobVertexID_SCALE_UP_RATE_THRESHOLD_Current*
> and {*}AutoScaler_jobVertexID_SCALE_DOWN_RATE_THRESHOLD_Current{*})
--
This message was sent by Atlassian Jira
(v8.20.10#820010)