Re: [PR] Enable segments read/published stats on Compaction task completion reports (druid)

via GitHub Tue, 05 Mar 2024 02:29:24 -0800


kfaraz commented on code in PR #15947:
URL: https://github.com/apache/druid/pull/15947#discussion_r1512535444



##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/ParallelIndexSupervisorTask.java:
##########
@@ -651,6 +654,15 @@ private TaskStatus runSinglePhaseParallel(TaskToolbox 
toolbox) throws Exception
     if (state.isSuccess()) {
       //noinspection ConstantConditions
       publishSegments(toolbox, parallelSinglePhaseRunner.getReports());
+      segmentsRead = parallelSinglePhaseRunner.getReports()
+                                              .values()
+                                              .stream()
+                                              .mapToLong(report -> 
report.getOldSegments().size()).sum();
+      segmentsPublished = parallelSinglePhaseRunner.getReports()

Review Comment:
   Why is `segmentsPublished` not directly assigned the result of 
`publishSegments()` like in the other places?



##########
docs/ingestion/tasks.md:
##########
@@ -155,6 +155,12 @@ For some task types, the indexing task can wait for the 
newly ingested segments
 |`segmentAvailabilityWaitTimeMs`|Milliseconds waited by the ingestion task for 
the newly ingested segments to be available for query after completing 
ingestion was completed.|
 |`recordsProcessed`| Partitions that were processed by an ingestion task and 
includes count of records processed from each partition.|
 
+
+|Field| Description |
+|---|---|
+| `segmentsRead`| Number of segments read by compaction task with more than 1 
subtask.|
+| `segmentsPublished`| Number of segments published by compaction task with 
more than 1 subtask.|

Review Comment:
   Nit: The Druid docs generally avoid the extra spaces in tables:
   
   ```suggestion
   |Field|Description|
   |---|---|
   |`segmentsRead`|Number of segments read by compaction task with more than 1 
subtask.|
   |`segmentsPublished`|Number of segments published by compaction task with 
more than 1 subtask.|
   ```



##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/GeneratedPartitionsReport.java:
##########
@@ -37,11 +38,19 @@ public class GeneratedPartitionsReport implements 
SubTaskReport
   private final List<PartitionStat> partitionStats;
   private final Map<String, TaskReport> taskReport;
 
-  GeneratedPartitionsReport(String taskId, List<PartitionStat> partitionStats, 
Map<String, TaskReport> taskReport)
+  private final Long segmentsRead;
+
+  GeneratedPartitionsReport(
+      String taskId,
+      List<PartitionStat> partitionStats,
+      Map<String, TaskReport> taskReport,
+      Long segmentsRead

Review Comment:
   It is weird to have this as a top-level field in all of these classes when 
we already have a well-structured `TaskReport`. Ideally, this information 
should be present inside the `taskReport` map passed to this class and then 
later be extracted and processed in `ParallelIndexSupervisorTask` as needed.



##########
indexing-service/src/main/java/org/apache/druid/indexing/common/ParallelCompactionTaskReportData.java:
##########
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.common;
+
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.indexer.IngestionState;
+
+import javax.annotation.Nullable;
+import java.util.Map;
+import java.util.Objects;
+
+/**
+ * This class has fields specific to the parallel compaction task and hence 
extends the generic
+ * task report data {@link IngestionStatsAndErrorsTaskReportData}
+ */
+public class ParallelCompactionTaskReportData extends 
IngestionStatsAndErrorsTaskReportData

Review Comment:
   Why do we need a new class for just 2 new fields?
   `segmentsPublished` and `segmentsRead` is information that can be useful in 
the case of non-compaction jobs too. We can use the existing 
`IngestionStatsAndErrorsTaskReportData` class and add these two as nullable 
fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Enable segments read/published stats on Compaction task completion reports (druid)

Reply via email to