rkhachatryan commented on code in PR #24435:
URL: https://github.com/apache/flink/pull/24435#discussion_r1512875008
##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -632,4 +669,18 @@ public Long getValue() {
}
}
}
+
+ private class LatestCompletedCheckpointTimeSinceMillisGauge implements
Gauge<Long> {
+ @Override
+ public Long getValue() {
+ CompletedCheckpointStats completed = latestCompletedCheckpoint;
+ final long startTime;
+ if (completed != null) {
+ startTime = completed.getCompletedTimestamp();
Review Comment:
As discussed offline, ideally we might want some different logic if there
are no resources available to run the job, e.g.:
```
Failed to trigger checkpoint for job xxx since Checkpoint triggering task
yyy is not being executed at the moment.
Aborting checkpoint. Failure reason: Not all required tasks are currently
running..
```
I'm not sure how to implement this though.
##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -137,6 +163,10 @@ public CheckpointStatsTracker(
history.createSnapshot(),
null);
+ this.clock = Preconditions.checkNotNull(clock);
+ // Init to measure the time since the job started as metric for the
first checkpoint
+ this.checkpointTrackingStartTimestamp = clock.absoluteTimeMillis();
Review Comment:
As discussed offline, ideally we should preserve the metric in case of JM
failover.
For that, we need to:
1. set `checkpointTrackingStartTimestamp` to `JobGraph` creation timestamp
(which can be added to JG).
2. in `reportRestoredCheckpoint`, set it to
`max(checkpointTrackingStartTimestamp, checkpoint.timestamp)`
(`checkpointTrackingStartTimestamp` can be renamed or another field added
...)
##########
docs/content.zh/docs/ops/metrics.md:
##########
@@ -1312,6 +1312,11 @@ Note that for failed checkpoints, metrics are updated on
a best efforts basis an
<td>The identifier of the last completed checkpoint.</td>
<td>Gauge</td>
</tr>
+ <tr>
+ <td>lastCompletedCheckpointTimeSinceMillis</td>
+ <td>The time that has passed since the last completed checkpoint in
milliseconds. If no checkpoint has been completed, this measures the time since
the start of the job. After restore and before another checkpoint completes,
this measures the time from the completion timestamp of the restored
checkpoint.</td>
Review Comment:
> If no checkpoint has been completed, this measures the time since the
start of the job.
Maybe clarify that in case of JM failover, this will be the time since the
start of the job **master**?
(ditto: english version)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]