Re: [PR] Populate segment stats for non-parallel compaction jobs (druid)

via GitHub Wed, 27 Mar 2024 23:23:20 -0700


kfaraz commented on code in PR #16171:
URL: https://github.com/apache/druid/pull/16171#discussion_r1542350627



##########
indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java:
##########
@@ -620,4 +622,20 @@ public static List<TimelineObjectHolder<String, 
DataSegment>> getTimelineForSegm
 
     return new ArrayList<>(timeline.values());
   }
+
+  public int estimateSegmentsCount()

Review Comment:
   Why estimate? Wouldn't this be the actual number?
   Please add a javadoc too.



##########
indexing-service/src/test/java/org/apache/druid/indexing/common/task/IngestionTestBase.java:
##########
@@ -114,6 +119,7 @@ public abstract class IngestionTestBase extends 
InitializedNullHandlingTest
   private SegmentsMetadataManager segmentsMetadataManager;
   private TaskLockbox lockbox;
   private File baseDir;
+  protected File reportsFile;

Review Comment:
   We don't seem to be writing to this file anywhere.



##########
indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java:
##########
@@ -620,4 +622,20 @@ public static List<TimelineObjectHolder<String, 
DataSegment>> getTimelineForSegm
 
     return new ArrayList<>(timeline.values());
   }
+
+  public int estimateSegmentsCount()
+  {
+    if (segmentIds != null) {
+      return segmentIds.size();
+    }
+
+    Set<SegmentId> ids = new HashSet<>();
+    for (TimelineObjectHolder<String, DataSegment> holder : createTimeline()) {

Review Comment:
   `createTimeline()` is a costly operation. Ideally, when `createTimeline()` 
is being called in the normal flow, we should just compute the number of 
segments and save it in a field and return than whenever required rather than 
freshly building the timeline every time we need the reports.



##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java:
##########
@@ -1071,7 +1077,14 @@ private TaskStatus generateAndPublishSegments(
 
         log.debugSegments(published.getSegments(), "Published segments");
 
-        updateAndWriteCompletionReports(toolbox);
+        updateAndWriteCompletionReports(
+            toolbox,
+            // only applicable to the compaction use cases
+            inputSource instanceof DruidInputSource
+            ? (long) ((DruidInputSource) inputSource).estimateSegmentsCount()

Review Comment:
   @adithyachakilam , do you recall why we had kept these fields as long? Since 
they are the number of segments read and published, I don't think they will 
ever be bigger than an int (2 billion). Should we just make these fields as 
Integer?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Re: [PR] Populate segment stats for non-parallel compaction jobs (druid)

Reply via email to