featzhang created FLINK-39489:
---------------------------------
Summary: [web dashboard]Add Top N Metrics Dashboard to Flink Web UI
Key: FLINK-39489
URL: https://issues.apache.org/jira/browse/FLINK-39489
Project: Flink
Issue Type: New Feature
Components: Runtime / Web Frontend
Reporter: featzhang
h2. Background
Operators troubleshooting a running Flink job today have to navigate
through many subtask-level pages in the Web UI to locate resource
hotspots (high CPU, backpressure, GC). There is no single place that
ranks tasks/operators by these signals.
h2. Goal
Provide a *Top N Metrics Dashboard* in the Flink Web UI that, for a
given job, lists the most resource-intensive components across three
dimensions:
* Top N CPU consumers — tasks with the highest CPU usage
* Top N backpressured operators — operators with the highest
backpressure ratio
* Top N GC-intensive tasks — tasks with the highest GC overhead
h2. Proposed Solution
* Add a new REST endpoint {{GET /jobs/:jobid/metrics/top-n}} served by a
new {{TopNMetricsHandler}} (extends {{{}AbstractRestHandler{}}}), wired
into {{{}WebMonitorEndpoint{}}}.
* Aggregate metrics via the *public* {{MetricStore}} API
({{{}getRepresentativeAttempts(){}}},
{{{}getAllSubtaskMetricStores(){}}}) — no reliance on package-private
state.
* Response body lives under
{{org.apache.flink.runtime.rest.messages.job.metrics}} and returns
three ranked sections (CPU / Backpressure / GC), default N = 5.
* Add an Angular {{TopNMetricsComponent}} + {{TopNMetricsService}} in
{{flink-runtime-web}} that renders the three Top N lists and an
optional {{/overview-demo}} entry point.
h2. Non-goals
* No changes to how metrics are collected or cached.
* No changes to existing REST endpoints or public APIs.
* Historical/time-series ranking is out of scope for this issue.
h2. Verification
# Build: {{mvn clean install -DskipTests}} and
{{npm run build}} in {{{}flink-runtime-web/web-dashboard{}}}.
# Integration: run a sample job, call
{\{GET /jobs/ {jobId}
/metrics/top-n}} and assert the three sections
are populated.
# UI: load the dashboard page in the Web UI and visually verify
the three Top N lists render with live data.
# Add unit tests for {{TopNMetricsHandler}} aggregation logic.
h2. Notes
Previous attempts: [PR
#27771|https://github.com/apache/flink/pull/27771](dropped — architectural
issues) and
[PR #27773|https://github.com/apache/flink/pull/27773] (dropped).
The replacement PR is[ PR #27774|https://github.com/apache/flink/pull/27774],
which currently references an incorrect JIRA id ({{{}FLINK-27773{}}}, which
belongs to "Introduce the E2E tests for SQL Gateway") and must be re-linked to
this new ticket.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)