yunfengzhou-hub opened a new pull request, #26951: URL: https://github.com/apache/flink/pull/26951
## What is the purpose of the change This PR optimizes the latency of Flink REST handlers used to generate the DAG in Flink UI. In the current implementation, REST handlers like `JobDetailsHandle`r would iterate through all vertexes of a job, and invoke `MetricStore#getSubtaskAttemptMetricStore` during each iteration. Given that this is a synchronized method, invocations to this method could possibly be blocked until other threads finished invoking other synchronized methods. This blocking overhead is accumulated with the for loop, resulting in high latency when Flink UI tries to render the status of a Flink job through `JobDetailsHandler`. In order to solve this problem, this PR proposes to reduce the number of synchronized invocations in REST handlers. A snapshot of the MetricStore jobs is acquired for each handler (and the synchronization overhead is accumulated only once here), and the snapshot is then reused in the for loops. The snapshot is read only so it needs not be synchronized. As for benchmark results, we manually measured the latency for the Flink UI to display the DAG of a sophisticated Flink job in our company. Before optimization, the Flink UI needs more than 1 minute to finish the display. After the optimization, the latency decreased to less than 10 seconds. ## Brief change log - Introduce MetricStore.MetricStoreJobs to manage a snapshot of all jobs in the MetricStore. Compared with original implementation to operate on MetricStore jobs, the new implementation does not need synchronized keywords on the methods. ## Verifying this change The correctness of this PR is covered by existing tests, such as JobDetailsHandlerTest and MetricStoreTest. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org