yunfengzhou-hub opened a new pull request, #26951:
URL: https://github.com/apache/flink/pull/26951

   ## What is the purpose of the change
   
   This PR optimizes the latency of Flink REST handlers used to generate the 
DAG in Flink UI.
   
   In the current implementation, REST handlers like `JobDetailsHandle`r would 
iterate through all vertexes of a job, and invoke 
`MetricStore#getSubtaskAttemptMetricStore` during each iteration. Given that 
this is a synchronized method, invocations to this method could possibly be 
blocked until other threads finished invoking other synchronized methods. This 
blocking overhead is accumulated with the for loop, resulting in high latency  
when Flink UI tries to render the status of a Flink job through 
`JobDetailsHandler`.
   
   In order to solve this problem, this PR proposes to reduce the number of 
synchronized invocations in REST handlers. A snapshot of the MetricStore jobs 
is acquired for each handler (and the synchronization overhead is accumulated 
only once here), and the snapshot is then reused in the for loops. The snapshot 
is read only so it needs not be synchronized.
   
   As for benchmark results, we manually measured the latency for the Flink UI 
to display the DAG of a sophisticated Flink job in our company. Before 
optimization, the Flink UI needs more than 1 minute to finish the display. 
After the optimization, the latency decreased to less than 10 seconds.
   
   ## Brief change log
   
   - Introduce MetricStore.MetricStoreJobs to manage a snapshot of all jobs in 
the MetricStore. Compared with original implementation to operate on 
MetricStore jobs, the new implementation does not need synchronized keywords on 
the methods.
   
   ## Verifying this change
   
   The correctness of this PR is covered by existing tests, such as 
JobDetailsHandlerTest and MetricStoreTest.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to