Re: [PR] [FLINK-34579][metrics] Introduce metric for time since last competed checkpoint [flink]

via GitHub Tue, 05 Mar 2024 06:05:37 -0800


rkhachatryan commented on code in PR #24435:
URL: https://github.com/apache/flink/pull/24435#discussion_r1512875008



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -632,4 +669,18 @@ public Long getValue() {
             }
         }
     }
+
+    private class LatestCompletedCheckpointTimeSinceMillisGauge implements 
Gauge<Long> {
+        @Override
+        public Long getValue() {
+            CompletedCheckpointStats completed = latestCompletedCheckpoint;
+            final long startTime;
+            if (completed != null) {
+                startTime = completed.getCompletedTimestamp();

Review Comment:
   As discussed offline, ideally we might want some different logic if there 
are no resources available to run the job, e.g.:
   ```
   Failed to trigger checkpoint for job xxx since Checkpoint triggering task 
yyy is not being executed at the moment. 
   Aborting checkpoint. Failure reason: Not all required tasks are currently 
running..
   ```
   
   I'm not sure how to implement this though.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -137,6 +163,10 @@ public CheckpointStatsTracker(
                         history.createSnapshot(),
                         null);
 
+        this.clock = Preconditions.checkNotNull(clock);
+        // Init to measure the time since the job started as metric for the 
first checkpoint
+        this.checkpointTrackingStartTimestamp = clock.absoluteTimeMillis();

Review Comment:
   As discussed offline, ideally we should preserve the metric in case of JM 
failover.
   
   For that, we need to:
   1. set `checkpointTrackingStartTimestamp` to `JobGraph` creation timestamp 
(which can be added to JG).
   2. in `reportRestoredCheckpoint`, set it to 
`max(checkpointTrackingStartTimestamp, checkpoint.timestamp)`
   
   (`checkpointTrackingStartTimestamp` can be renamed or another field added 
...)



##########
docs/content.zh/docs/ops/metrics.md:
##########
@@ -1312,6 +1312,11 @@ Note that for failed checkpoints, metrics are updated on 
a best efforts basis an
       <td>The identifier of the last completed checkpoint.</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>lastCompletedCheckpointTimeSinceMillis</td>
+      <td>The time that has passed since the last completed checkpoint in 
milliseconds. If no checkpoint has been completed, this measures the time since 
the start of the job. After restore and before another checkpoint completes, 
this measures the time from the completion timestamp of the restored 
checkpoint.</td>

Review Comment:
   > If no checkpoint has been completed, this measures the time since the 
start of the job.
   
   Maybe clarify that in case of JM failover, this will be the time since the 
start of the job **master**?
   
   (ditto: english version)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-34579][metrics] Introduce metric for time since last competed checkpoint [flink]

Reply via email to