featzhang created FLINK-39079:
---------------------------------
Summary: Add Diagnostic Summary Page in Flink Web UI
Key: FLINK-39079
URL: https://issues.apache.org/jira/browse/FLINK-39079
Project: Flink
Issue Type: New Feature
Components: Runtime / Web Frontend
Reporter: featzhang
Currently, when troubleshooting Flink jobs, users need to navigate across
multiple pages in the Web UI to collect diagnostic information:
- Check checkpoints page for checkpointing issues
- View backpressure page for operator bottlenecks
- Monitor task managers for resource usage
- Review logs for error messages
- Check metrics dashboard for performance indicators
This fragmented approach makes it time-consuming and error-prone to quickly
identify the root cause of job problems. Users often have to manually correlate
information from different sources to understand the overall health of their
jobs.
**Motivation:**
The proposed Diagnostic Summary Page will consolidate key diagnostic
information into a single, easily accessible dashboard. This will significantly
improve operational efficiency by:
- Providing a unified view of job health status at a glance
- Highlighting the most critical issues with visual indicators
- Reducing the time required to diagnose problems from minutes to seconds
- Enabling faster incident response and reduced downtime
- Lowering the learning curve for new users by presenting information in a
structured way
**Proposed Changes:**
1. **Add a new "Diagnostics" tab** in the Job Overview page, positioned
alongside existing tabs (Overview, Checkpoints, Backpressure, etc.)
2. **Diagnostic Categories and Metrics:**
a. **Job Status Summary**
- Job state (RUNNING, FAILED, CANCELED, etc.)
- Job duration and restart history
- Last failure timestamp and error message (if applicable)
b. **Checkpoint Health**
- Checkpoint status indicator (Healthy/Unhealthy)
- Latest checkpoint duration
- Checkpoint alignment duration
- Failed checkpoint count in last 10 minutes
- Trend chart showing checkpoint times over the job lifecycle
c. **Backpressure Analysis**
- List of operators with high backpressure (> 80%)
- Backpressure severity ranking (Top 10)
- Affected subtasks and task managers
d. **Resource Utilization**
- Top 10 CPU-intensive tasks
- Top 10 memory-intensive tasks
- Task managers with high GC frequency
- Network throughput per connection
e. **Error Tracking**
- Recent error messages grouped by type
- Count of exceptions in the last 5 minutes
- Stack trace snippets for most frequent errors
f. **Alert Recommendations**
- Auto-generated suggestions based on detected issues
- Links to relevant documentation or configuration options
3. **UI/UX Design:**
- Use color-coded status indicators (Green=Healthy, Yellow=Warning,
Red=Critical)
- Implement collapsible sections for each diagnostic category
- Support filtering and sorting for lists (e.g., by severity, timestamp)
- Include a "Refresh" button to update real-time metrics
- Export diagnostic report as JSON/JSON file
4. **Backend Changes:**
- Add REST endpoint: `GET /jobs/:jobid/diagnostics`
- Create `JobDiagnosticsHandler` to aggregate metrics from existing handlers
- Implement efficient caching to avoid redundant metric collection
**Alternatives Considered:**
1. **Dashboard Extension**: Instead of a dedicated diagnostics page, extend the
existing Overview page. Rejected because it would make the Overview page
cluttered and less focused on high-level job information.
2. **CLI-based Diagnostics**: Provide a command-line tool to export diagnostic
information. Rejected because the Web UI is more accessible to a broader range
of users, especially those responsible for monitoring and operations.
3. **Third-party Integration**: Rely on external monitoring tools (e.g.,
Prometheus, Grafana). Rejected because it adds operational complexity and
doesn't help users who don't have such tools already set up.
**Additional Context:**
- Target Version: 1.21
- Component: Web Frontend / Runtime / REST
- Priority: Major
- Labels: web-ui, diagnostics, usability
This feature builds upon existing Web UI improvements such as the Top N Metrics
dashboard and aligns with Flink's ongoing efforts to improve observability and
operational experience.
**Related Issues:**
- FLINK-XXXXX: Add Top N Metrics Dashboard (already implemented)
- FLINK-XXXXX: Improve exception messages
--
This message was sent by Atlassian Jira
(v8.20.10#820010)