[
https://issues.apache.org/jira/browse/FLINK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-38584:
-----------------------------------
Labels: pull-request-available (was: )
> Support checkpoint external path as Prometheus info-style metric
> ----------------------------------------------------------------
>
> Key: FLINK-38584
> URL: https://issues.apache.org/jira/browse/FLINK-38584
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics
> Affects Versions: 1.17.0, 1.13
> Reporter: sohurdc
> Priority: Major
> Labels: pull-request-available
>
> h2. Problem Statement
> Currently, the lastCheckpointExternalPath metric in Flink is exported as a
> Gauge with the checkpoint path as its string value. This approach has several
> limitations:
> # Incompatible with Prometheus/VictoriaMetrics: Time-series databases like
> Prometheus and VictoriaMetrics only support numeric values, making it
> impossible to store checkpoint paths without using additional storage
> solutions like InfluxDB.
> # Limited Observability: Users cannot easily correlate checkpoint paths with
> other checkpoint metrics (size, duration, etc.) in their monitoring
> dashboards.
> # Workaround Required: Currently, users need to set up separate storage
> systems (e.g., InfluxDB) just to track checkpoint paths, increasing
> operational complexity.
> h2. Proposed Solution
> {{ Export lastCheckpointExternalPath as a Prometheus info-style metric: }}
> * {{Metric name: lastCheckpointExternalPath_info}}
> * {{Value: Always 1.0 (following Prometheus convention) }}
> * {{Checkpoint path: Stored in a path label }}{{}}
> {{This approach: }}
> * {{✅ Compatible with Prometheus/VictoriaMetrics }}
> * {{✅ Follows Prometheus best practices for string-value metrics (similar to
> node_uname_info) }}
> * {{✅ Enables joining with other metrics via PromQL }}
> * {{✅ No breaking changes to existing metrics}}
> h2. {{{}Example Output{}}}{{{}{}}}
> {{Before:}}
> flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."}
> "hdfs://..."
> ❌ Not supported by Prometheus thus it will be tranfered
> to:flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."}
> 0, which losed its real meaning.
> After:
> flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",host="...",{*}path="hdfs://..."{*}}
> 1.0
> ✅ Fully compatible with Prometheus
> h2. Use Cases
> # {{{*}Dashboard Visualization{*}: Join checkpoint path with other metrics}}
> {{flink_jobmanager_job_lastCheckpointSize }}
> {{ * on(job_id) group_left(path) }}
> {{ flink_jobmanager_job_lastCheckpointExternalPath_info}}
> # {{{*}Alerting{*}: Detect checkpoint path changes}}
> {{changes(flink_jobmanager_job_lastCheckpointExternalPath_info[5m]) > 0}}
> # {{{*}Metadata Extraction{*}: Extract path for external systems via
> Prometheus API}}
> {{result['metric']['path'] # Get checkpoint path value}}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)