fdc91 opened a new pull request, #26204:
URL: https://github.com/apache/flink/pull/26204
## What is the purpose of the change
This PR addresses a performance regression in the MetricStore that impacts
clients fetching metrics, such as the autoscaler or web UI. The issue occurs
when the /metrics endpoint becomes unresponsive due to delays in removing
transient metrics for completed subtasks. This cleanup process is executed
synchronously during metric retrieval, leading to significant
slowdowns—particularly when the JM has multiple jobs or subtasks in a terminal
state. These delays prevent timely metric fetching, disrupting
latency-sensitive systems like the autoscaler. The root cause, identified via
flamegraph analysis, is the inefficient synchronous execution of the cleanup
routine introduced with FLINK-31650.
## Brief change log
- Optimized the metrics cleanup process in `MetricStore` by caching the
names of transient metrics when first stored
- Improved metric removal efficiency by executing the cleanup routine only
once
## Verifying this change
Relying on UT added in https://github.com/apache/flink/pull/23988
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]