[
https://issues.apache.org/jira/browse/FLINK-39079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-39079:
-----------------------------------
Labels: pull-request-available (was: )
> Add Diagnostic Summary Page in Flink Web UI
> -------------------------------------------
>
> Key: FLINK-39079
> URL: https://issues.apache.org/jira/browse/FLINK-39079
> Project: Flink
> Issue Type: New Feature
> Components: Runtime / Web Frontend
> Reporter: featzhang
> Priority: Major
> Labels: pull-request-available
>
> Currently, troubleshooting Flink jobs requires navigating across multiple Web
> UI pages to collect diagnostic information:
> * *Checkpoints:* For checkpointing issues.
> * *Backpressure:* For operator bottlenecks.
> * *Task Managers:* For resource usage.
> * *Logs:* For error messages.
> * *Metrics:* For performance indicators.
> This fragmented approach is time-consuming and error-prone, as users must
> manually correlate data from different sources.
> h4. *Motivation*
> The proposed *Diagnostic Summary Page* aims to consolidate key diagnostic
> information into a single dashboard to:
> # Provide a unified view of job health at a glance.
> # Highlight critical issues with visual indicators.
> # Reduce diagnosis time from minutes to seconds.
> # Lower the learning curve for new users.
> h4. *Proposed Changes*
> *1. New "Diagnostics" Tab*
> * Add a new *"Diagnostics"* tab in the Job Overview page alongside existing
> tabs (Overview, Checkpoints, Backpressure, etc.).
> *2. Diagnostic Categories and Metrics*
> The page will include the following modules:
> * *Job Status Summary:* State (RUNNING/FAILED), duration, restart history,
> last failure timestamp, and error message.
> * *Checkpoint Health:* Status indicator (Healthy/Unhealthy), latest
> duration, alignment duration, failed count (last 10 mins), and trend charts.
> * *Backpressure Analysis:* List of operators with high backpressure (>80%),
> severity ranking (Top 10), and affected subtasks/TMs.
> * *Resource Utilization:* Top 10 CPU/Memory intensive tasks, TMs with high
> GC frequency, and network throughput.
> * *Error Tracking:* Recent errors grouped by type, exception counts (last 5
> mins), and stack trace snippets.
> * *Alert Recommendations:* Auto-generated suggestions based on detected
> issues with links to documentation.
> *3. UI/UX Design*
> * *Color-coded indicators:* Green (Healthy), Yellow (Warning), Red
> (Critical).
> * *Collapsible sections* for each category.
> * *Filtering & Sorting* for lists (e.g., by severity).
> * *Refresh button* for real-time updates.
> * *Export function* to save reports as JSON/HTML.
> *4. Backend Changes*
> * *New REST Endpoint:* {{GET /jobs/:jobid/diagnostics}}
> * *New Handler:* {{JobDiagnosticsHandler}} to aggregate metrics from
> existing handlers.
> * *Caching:* Implement efficient caching to avoid redundant collection.
> h4. *Alternatives Considered*
> # *Dashboard Extension:* Extending the existing Overview page was rejected
> to avoid clutter.
> # *CLI-based Diagnostics:* Rejected due to lower accessibility compared to
> the Web UI for operations teams.
> # *Third-party Integration:* Rejected as it adds operational complexity and
> excludes users without external monitoring tools.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)