sohurdc opened a new pull request, #27165:
URL: https://github.com/apache/flink/pull/27165
## What is the purpose of the change
This pull request enhances the Prometheus reporter to export the
`lastCheckpointExternalPath` metric as an info-style metric, making it
compatible with Prometheus and VictoriaMetrics.
**Current Problem:**
- The `lastCheckpointExternalPath` metric is currently exported as a
string-valued Gauge
- Prometheus and VictoriaMetrics only support numeric values, making it
impossible to store checkpoint paths
- Users must use additional storage systems (e.g., InfluxDB) to track
checkpoint paths, increasing operational complexity
**Solution:**
- Export `lastCheckpointExternalPath` as a Prometheus info-style metric with
`_info` suffix
- Store the checkpoint path in a `path` label instead of as a metric value
- Set the metric value to 1.0 (following Prometheus convention for info
metrics)
This approach follows Prometheus best practices (similar to
`node_uname_info` from node_exporter) and enables users to:
1. Store checkpoint paths directly in Prometheus/VictoriaMetrics
2. Join checkpoint paths with other checkpoint metrics via PromQL
3. Create monitoring dashboards and alerts based on checkpoint paths
## Brief change log
- Added `CHECKPOINT_PATH_METRIC_NAME` constant to identify the checkpoint
path metric
- Modified
[createCollector()](cci:1://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java:153:4-195:5)
method in
[AbstractPrometheusReporter](cci:2://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/AbstractPrometheusReporter.java:53:0-460:1)
to detect and handle checkpoint path metrics specially
- Added
[CheckpointPathInfoCollector](cci:2://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollector.java:40:0-91:1)
inner class to export checkpoint path as an info-style metric
- Appends `_info` suffix to the metric name
- Stores checkpoint path in a `path` label
- Sets metric value to 1.0
- Handles null and empty path values gracefully
- Added comprehensive unit tests in
[CheckpointPathInfoCollectorTest](cci:2://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollectorTest.java:44:0-324:1)
with 4 test cases
## Verifying this change
This change added tests and can be verified as follows:
**Unit Tests:**
- Added
[CheckpointPathInfoCollectorTest](cci:2://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollectorTest.java:44:0-324:1)
with 4 test cases:
-
[testCheckpointPathExportedAsInfoMetric](cci:1://file:///Users/lihaopeng/IdeaProjects/sohurdc/flink/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollectorTest.java:34:4-68:5):
Verifies checkpoint path is correctly exported as an info metric with path in
label
-
[testNullCheckpointPathReturnsEmptyList](cci:1://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollectorTest.java:76:4-93:5):
Verifies null path values are handled correctly (returns empty list)
-
[testEmptyCheckpointPathReturnsEmptyList](cci:1://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollectorTest.java:95:4-112:5):
Verifies empty string path values are handled correctly
-
[testCheckpointPathWithSpecialCharacters](cci:1://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollectorTest.java:114:4-138:5):
Verifies special characters in paths (e.g., S3 URLs with query parameters) are
preserved correctly
**Integration Verification:**
All existing Prometheus reporter tests pass (27/27 tests):
-
[PrometheusReporterTest](cci:2://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/PrometheusReporterTest.java:61:0-417:1):
14 tests
- `PrometheusReporterTaskScopeTest`: 5 tests
- `PrometheusPushGatewayReporterTest`: 4 tests
-
[CheckpointPathInfoCollectorTest](cci:2://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/test/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollectorTest.java:44:0-324:1):
4 tests (new)
**Manual Verification:**
The change can be manually verified by:
1. Starting a Flink cluster with Prometheus reporter enabled
2. Running a job with checkpointing enabled
3. Querying the Prometheus metrics endpoint (e.g., `curl
http://localhost:9249/metrics`)
4. Verifying the output contains:
```
flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",job_name="...",path="hdfs://..."}
1.0
```
5. Using PromQL to join with other metrics:
```promql
flink_jobmanager_job_lastCheckpointSize
* on(job_id) group_left(path)
flink_jobmanager_job_lastCheckpointExternalPath_info
```
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): **no**
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: **no**
- The serializers: **no**
- The runtime per-record code paths (performance sensitive): **no**
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: **no** (only affects
metric reporting, not checkpoint functionality)
- The S3 file system connector: **no**
## Documentation
- Does this pull request introduce a new feature? **yes**
- If yes, how is the feature documented? **JavaDocs**
**Documentation Details:**
- Comprehensive JavaDoc added to
[CheckpointPathInfoCollector](cci:2://file:///Users/lihaopeng/IdeaProjects/bigdata-hadoop3-flink-src/flink-metrics/flink-metrics-prometheus/src/main/java/org/apache/flink/metrics/prometheus/CheckpointPathInfoCollector.java:40:0-91:1)
class explaining:
- Purpose: Export checkpoint path as Prometheus info-style metric
- Behavior: Path stored in label, value always 1.0
- Example output format
- Inline code comments explaining the special handling logic
- Unit test documentation demonstrating usage patterns
**Additional Documentation (if requested):**
If the community requires, I can add documentation to
`docs/content/docs/deployment/metric_reporters.md` explaining:
- The info-style metric format for checkpoint paths
- PromQL query examples for joining with other metrics
- Use cases for monitoring and alerting
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]