StefanRRichter commented on code in PR #23908:
URL: https://github.com/apache/flink/pull/23908#discussion_r1440576635


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -155,7 +167,12 @@ public CheckpointStatsSnapshot createSnapshot() {
                                 counts.createSnapshot(),
                                 summary.createSnapshot(),
                                 history.createSnapshot(),
-                                latestRestoredCheckpoint);
+                                jobInitializationMetricsBuilder
+                                        .map(
+                                                JobInitializationMetricsBuilder
+                                                        
::buildRestoredCheckpointStats)
+                                        .orElse(Optional.empty())
+                                        .orElse(null));

Review Comment:
   Why 2x orElse?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -86,9 +88,10 @@ public class CheckpointStatsTracker {
 
     private final JobID jobID;
     private final MetricGroup metricGroup;
+    private int totalNumberOfSubTasks;
 
-    /** The latest restored checkpoint. */
-    @Nullable private RestoredCheckpointStats latestRestoredCheckpoint;
+    private Optional<JobInitializationMetricsBuilder> 
jobInitializationMetricsBuilder =

Review Comment:
   Why this? I think optional isn't even intended to be used for fields.



##########
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java:
##########
@@ -730,57 +739,87 @@ void restoreInternal() throws Exception {
         
getEnvironment().getMetricGroup().getIOMetricGroup().markTaskInitializationStarted();
         LOG.debug("Initializing {}.", getName());
 
-        operatorChain =
-                
getEnvironment().getTaskStateManager().isTaskDeployedAsFinished()
-                        ? new FinishedOperatorChain<>(this, recordWriter)
-                        : new RegularOperatorChain<>(this, recordWriter);
-        mainOperator = operatorChain.getMainOperator();
+        SubTaskInitializationMetricsBuilder initializationMetrics =
+                new SubTaskInitializationMetricsBuilder(
+                        SystemClock.getInstance().absoluteTimeMillis());
+        try {
+            operatorChain =
+                    
getEnvironment().getTaskStateManager().isTaskDeployedAsFinished()
+                            ? new FinishedOperatorChain<>(this, recordWriter)
+                            : new RegularOperatorChain<>(this, recordWriter);
+            mainOperator = operatorChain.getMainOperator();
 
-        getEnvironment()
-                .getTaskStateManager()
-                .getRestoreCheckpointId()
-                .ifPresent(restoreId -> latestReportCheckpointId = restoreId);
+            getEnvironment()

Review Comment:
   Revert formatting changes in this file?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/JobInitializationMetricsBuilder.java:
##########
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.checkpoint;
+
+import 
org.apache.flink.runtime.checkpoint.JobInitializationMetrics.SumMaxDuration;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+
+import static 
org.apache.flink.runtime.checkpoint.JobInitializationMetrics.UNSET;
+import static org.apache.flink.util.Preconditions.checkArgument;
+import static org.apache.flink.util.Preconditions.checkState;
+
+class JobInitializationMetricsBuilder {
+    private static final Logger LOG =
+            LoggerFactory.getLogger(JobInitializationMetricsBuilder.class);
+
+    private final List<SubTaskInitializationMetrics> reportedMetrics = new 
ArrayList<>();
+    private final int totalNumberOfSubTasks;
+    private final long startTs;
+    private Optional<Long> stateSize = Optional.empty();

Review Comment:
   nit: Optional abused for field.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -343,6 +376,62 @@ public void reportIncompleteStats(
         }
     }
 
+    public void reportInitializationStartTs(long initializationStartTs) {
+        jobInitializationMetricsBuilder =
+                Optional.of(
+                        new JobInitializationMetricsBuilder(
+                                totalNumberOfSubTasks, initializationStartTs));
+    }
+
+    public void reportInitializationMetrics(SubTaskInitializationMetrics 
initializationMetrics) {
+        statsReadWriteLock.lock();
+        try {
+            if (!jobInitializationMetricsBuilder.isPresent()) {
+                LOG.warn(
+                        "Attempted to report SubTaskInitializationMetrics [{}] 
without jobInitializationMetricsBuilder present",
+                        initializationMetrics);
+                return;
+            }
+            jobInitializationMetricsBuilder
+                    .get()
+                    .reportInitializationMetrics(initializationMetrics);
+            if (jobInitializationMetricsBuilder.get().isComplete()) {
+                
traceInitializationMetrics(jobInitializationMetricsBuilder.get().build());
+            }
+        } catch (Exception ex) {
+            LOG.warn("Fail to log SubTaskInitializationMetrics[{}]", ex, 
initializationMetrics);
+        } finally {
+            statsReadWriteLock.unlock();
+        }
+    }
+
+    private void traceInitializationMetrics(JobInitializationMetrics 
jobInitializationMetrics) {
+        SpanBuilder span =
+                Span.builder(CheckpointStatsTracker.class, "JobInitialization")
+                        
.setStartTsMillis(jobInitializationMetrics.getStartTs())
+                        .setEndTsMillis(jobInitializationMetrics.getEndTs())
+                        .setAttribute(
+                                "initializationStatus",
+                                jobInitializationMetrics.getStatus().name());
+        for (JobInitializationMetrics.SumMaxDuration duration :
+                jobInitializationMetrics.getDurationMetrics().values()) {
+            setDurationSpanAttribute(span, duration);
+        }
+        if (jobInitializationMetrics.getCheckpointId() != 
JobInitializationMetrics.UNSET) {
+            span.setAttribute("checkpointId", 
jobInitializationMetrics.getCheckpointId());
+        }
+        if (jobInitializationMetrics.getStateSize() != 
JobInitializationMetrics.UNSET) {
+            span.setAttribute("fullSize", 
jobInitializationMetrics.getStateSize());
+        }
+        metricGroup.addSpan(span);
+    }
+
+    private void setDurationSpanAttribute(
+            SpanBuilder span, JobInitializationMetrics.SumMaxDuration 
duration) {
+        span.setAttribute("max" + duration.getName(), duration.getMax());
+        span.setAttribute("sum" + duration.getName(), duration.getSum());

Review Comment:
   Why max and sum? Arguably, max is the most important one, is sum even 
meaningful to us? And if so, why not also avg or reporting the unaggregated 
values?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -343,6 +376,62 @@ public void reportIncompleteStats(
         }
     }
 
+    public void reportInitializationStartTs(long initializationStartTs) {
+        jobInitializationMetricsBuilder =
+                Optional.of(
+                        new JobInitializationMetricsBuilder(
+                                totalNumberOfSubTasks, initializationStartTs));
+    }
+
+    public void reportInitializationMetrics(SubTaskInitializationMetrics 
initializationMetrics) {
+        statsReadWriteLock.lock();
+        try {
+            if (!jobInitializationMetricsBuilder.isPresent()) {
+                LOG.warn(
+                        "Attempted to report SubTaskInitializationMetrics [{}] 
without jobInitializationMetricsBuilder present",
+                        initializationMetrics);
+                return;
+            }
+            jobInitializationMetricsBuilder
+                    .get()

Review Comment:
   nit: get() it once into a local variable for further use?



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointStatsTracker.java:
##########
@@ -343,6 +376,62 @@ public void reportIncompleteStats(
         }
     }
 
+    public void reportInitializationStartTs(long initializationStartTs) {
+        jobInitializationMetricsBuilder =
+                Optional.of(
+                        new JobInitializationMetricsBuilder(
+                                totalNumberOfSubTasks, initializationStartTs));
+    }
+
+    public void reportInitializationMetrics(SubTaskInitializationMetrics 
initializationMetrics) {
+        statsReadWriteLock.lock();
+        try {
+            if (!jobInitializationMetricsBuilder.isPresent()) {
+                LOG.warn(
+                        "Attempted to report SubTaskInitializationMetrics [{}] 
without jobInitializationMetricsBuilder present",
+                        initializationMetrics);
+                return;
+            }
+            jobInitializationMetricsBuilder
+                    .get()
+                    .reportInitializationMetrics(initializationMetrics);
+            if (jobInitializationMetricsBuilder.get().isComplete()) {
+                
traceInitializationMetrics(jobInitializationMetricsBuilder.get().build());
+            }
+        } catch (Exception ex) {
+            LOG.warn("Fail to log SubTaskInitializationMetrics[{}]", ex, 
initializationMetrics);

Review Comment:
   When that happens, would it make sense to clear the optional to avoid 
further reporting work? We will never get to completion for this checkpoint 
anymore.



##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/SubTaskInitializationMetricsBuilder.java:
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.checkpoint;
+
+import org.apache.flink.annotation.VisibleForTesting;
+
+import javax.annotation.concurrent.NotThreadSafe;
+
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * A builder for {@link SubTaskInitializationMetrics}.
+ *
+ * <p>This class is not thread safe, but parts of it can actually be used from 
different threads.

Review Comment:
   Consider adding a usage example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to