featzhang created FLINK-39489:
---------------------------------

             Summary: [web dashboard]Add Top N Metrics Dashboard to Flink Web UI
                 Key: FLINK-39489
                 URL: https://issues.apache.org/jira/browse/FLINK-39489
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Web Frontend
            Reporter: featzhang


h2. Background

Operators troubleshooting a running Flink job today have to navigate
through many subtask-level pages in the Web UI to locate resource
hotspots (high CPU, backpressure, GC). There is no single place that
ranks tasks/operators by these signals.
h2. Goal

Provide a *Top N Metrics Dashboard* in the Flink Web UI that, for a
given job, lists the most resource-intensive components across three
dimensions:
 * Top N CPU consumers — tasks with the highest CPU usage
 * Top N backpressured operators — operators with the highest
backpressure ratio
 * Top N GC-intensive tasks — tasks with the highest GC overhead

h2. Proposed Solution
 * Add a new REST endpoint {{GET /jobs/:jobid/metrics/top-n}} served by a
new {{TopNMetricsHandler}} (extends {{{}AbstractRestHandler{}}}), wired
into {{{}WebMonitorEndpoint{}}}.
 * Aggregate metrics via the *public* {{MetricStore}} API
({{{}getRepresentativeAttempts(){}}},
{{{}getAllSubtaskMetricStores(){}}}) — no reliance on package-private
state.
 * Response body lives under
{{org.apache.flink.runtime.rest.messages.job.metrics}} and returns
three ranked sections (CPU / Backpressure / GC), default N = 5.
 * Add an Angular {{TopNMetricsComponent}} + {{TopNMetricsService}} in
{{flink-runtime-web}} that renders the three Top N lists and an
optional {{/overview-demo}} entry point.

h2. Non-goals
 * No changes to how metrics are collected or cached.
 * No changes to existing REST endpoints or public APIs.
 * Historical/time-series ranking is out of scope for this issue.

h2. Verification
 # Build: {{mvn clean install -DskipTests}} and
{{npm run build}} in {{{}flink-runtime-web/web-dashboard{}}}.
 # Integration: run a sample job, call
{\{GET /jobs/ {jobId}
/metrics/top-n}} and assert the three sections
are populated.

 # UI: load the dashboard page in the Web UI and visually verify
the three Top N lists render with live data.
 # Add unit tests for {{TopNMetricsHandler}} aggregation logic.

h2. Notes

Previous attempts: [PR 
#27771|https://github.com/apache/flink/pull/27771](dropped — architectural 
issues) and
[PR #27773|https://github.com/apache/flink/pull/27773] (dropped).
The replacement PR is[ PR #27774|https://github.com/apache/flink/pull/27774], 
which currently references an incorrect JIRA id ({{{}FLINK-27773{}}}, which 
belongs to "Introduce the E2E tests for SQL Gateway") and must be re-linked to 
this new ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to