[ 
https://issues.apache.org/jira/browse/FLINK-39079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39079:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add Diagnostic Summary Page in Flink Web UI
> -------------------------------------------
>
>                 Key: FLINK-39079
>                 URL: https://issues.apache.org/jira/browse/FLINK-39079
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Web Frontend
>            Reporter: featzhang
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, troubleshooting Flink jobs requires navigating across multiple Web 
> UI pages to collect diagnostic information:
>  * *Checkpoints:* For checkpointing issues.
>  * *Backpressure:* For operator bottlenecks.
>  * *Task Managers:* For resource usage.
>  * *Logs:* For error messages.
>  * *Metrics:* For performance indicators.
> This fragmented approach is time-consuming and error-prone, as users must 
> manually correlate data from different sources.
> h4. *Motivation*
> The proposed *Diagnostic Summary Page* aims to consolidate key diagnostic 
> information into a single dashboard to:
> # Provide a unified view of job health at a glance.
>  # Highlight critical issues with visual indicators.
>  # Reduce diagnosis time from minutes to seconds.
>  # Lower the learning curve for new users.
> h4. *Proposed Changes*
> *1. New "Diagnostics" Tab*
>  * Add a new *"Diagnostics"* tab in the Job Overview page alongside existing 
> tabs (Overview, Checkpoints, Backpressure, etc.).
> *2. Diagnostic Categories and Metrics*
> The page will include the following modules:
>  * *Job Status Summary:* State (RUNNING/FAILED), duration, restart history, 
> last failure timestamp, and error message.
>  * *Checkpoint Health:* Status indicator (Healthy/Unhealthy), latest 
> duration, alignment duration, failed count (last 10 mins), and trend charts.
>  * *Backpressure Analysis:* List of operators with high backpressure (>80%), 
> severity ranking (Top 10), and affected subtasks/TMs.
>  * *Resource Utilization:* Top 10 CPU/Memory intensive tasks, TMs with high 
> GC frequency, and network throughput.
>  * *Error Tracking:* Recent errors grouped by type, exception counts (last 5 
> mins), and stack trace snippets.
>  * *Alert Recommendations:* Auto-generated suggestions based on detected 
> issues with links to documentation.
> *3. UI/UX Design*
>  * *Color-coded indicators:* Green (Healthy), Yellow (Warning), Red 
> (Critical).
>  * *Collapsible sections* for each category.
>  * *Filtering & Sorting* for lists (e.g., by severity).
>  * *Refresh button* for real-time updates.
>  * *Export function* to save reports as JSON/HTML.
> *4. Backend Changes*
>  * *New REST Endpoint:* {{GET /jobs/:jobid/diagnostics}}
>  * *New Handler:* {{JobDiagnosticsHandler}} to aggregate metrics from 
> existing handlers.
>  * *Caching:* Implement efficient caching to avoid redundant collection.
> h4. *Alternatives Considered*
>  # *Dashboard Extension:* Extending the existing Overview page was rejected 
> to avoid clutter.
>  # *CLI-based Diagnostics:* Rejected due to lower accessibility compared to 
> the Web UI for operations teams.
>  # *Third-party Integration:* Rejected as it adds operational complexity and 
> excludes users without external monitoring tools.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to