[ 
https://issues.apache.org/jira/browse/FLINK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-38584:
-----------------------------------
    Labels: pull-request-available  (was: )

> Support checkpoint external path as Prometheus info-style metric
> ----------------------------------------------------------------
>
>                 Key: FLINK-38584
>                 URL: https://issues.apache.org/jira/browse/FLINK-38584
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics
>    Affects Versions: 1.17.0, 1.13
>            Reporter: sohurdc
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Problem Statement
> Currently, the lastCheckpointExternalPath metric in Flink is exported as a 
> Gauge with the checkpoint path as its string value. This approach has several 
> limitations:
>  # Incompatible with Prometheus/VictoriaMetrics: Time-series databases like 
> Prometheus and VictoriaMetrics only support numeric values, making it 
> impossible to store checkpoint paths without using additional storage 
> solutions like InfluxDB.
>  # Limited Observability: Users cannot easily correlate checkpoint paths with 
> other checkpoint metrics (size, duration, etc.) in their monitoring 
> dashboards.
>  # Workaround Required: Currently, users need to set up separate storage 
> systems (e.g., InfluxDB) just to track checkpoint paths, increasing 
> operational complexity.
> h2. Proposed Solution
> {{ Export lastCheckpointExternalPath as a Prometheus info-style metric: }}
>  * {{Metric name: lastCheckpointExternalPath_info}}
>  * {{Value: Always 1.0 (following Prometheus convention) }}
>  * {{Checkpoint path: Stored in a path label }}{{}}
> {{This approach: }}
>  * {{✅ Compatible with Prometheus/VictoriaMetrics }}
>  * {{✅ Follows Prometheus best practices for string-value metrics (similar to 
> node_uname_info) }}
>  * {{✅ Enables joining with other metrics via PromQL }}
>  * {{✅ No breaking changes to existing metrics}}
> h2. {{{}Example Output{}}}{{{}{}}}
> {{Before:}}
> flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."} 
> "hdfs://..."
> ❌ Not supported by Prometheus thus it will be tranfered 
> to:flink_jobmanager_job_lastCheckpointExternalPath\{job_id="...",host="..."} 
> 0, which losed its real meaning.
> After:
> flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",host="...",{*}path="hdfs://..."{*}}
>  1.0
> ✅ Fully compatible with Prometheus
> h2. Use Cases
>  # {{{*}Dashboard Visualization{*}: Join checkpoint path with other metrics}}
> {{flink_jobmanager_job_lastCheckpointSize }}
> {{  * on(job_id) group_left(path) }}
> {{  flink_jobmanager_job_lastCheckpointExternalPath_info}}
>  # {{{*}Alerting{*}: Detect checkpoint path changes}}
> {{changes(flink_jobmanager_job_lastCheckpointExternalPath_info[5m]) > 0}}
>  # {{{*}Metadata Extraction{*}: Extract path for external systems via 
> Prometheus API}}
> {{result['metric']['path']  # Get checkpoint path value}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to