kfaraz commented on code in PR #16171: URL: https://github.com/apache/druid/pull/16171#discussion_r1542350627
########## indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java: ########## @@ -620,4 +622,20 @@ public static List<TimelineObjectHolder<String, DataSegment>> getTimelineForSegm return new ArrayList<>(timeline.values()); } + + public int estimateSegmentsCount() Review Comment: Why estimate? Wouldn't this be the actual number? Please add a javadoc too. ########## indexing-service/src/test/java/org/apache/druid/indexing/common/task/IngestionTestBase.java: ########## @@ -114,6 +119,7 @@ public abstract class IngestionTestBase extends InitializedNullHandlingTest private SegmentsMetadataManager segmentsMetadataManager; private TaskLockbox lockbox; private File baseDir; + protected File reportsFile; Review Comment: We don't seem to be writing to this file anywhere. ########## indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java: ########## @@ -620,4 +622,20 @@ public static List<TimelineObjectHolder<String, DataSegment>> getTimelineForSegm return new ArrayList<>(timeline.values()); } + + public int estimateSegmentsCount() + { + if (segmentIds != null) { + return segmentIds.size(); + } + + Set<SegmentId> ids = new HashSet<>(); + for (TimelineObjectHolder<String, DataSegment> holder : createTimeline()) { Review Comment: `createTimeline()` is a costly operation. Ideally, when `createTimeline()` is being called in the normal flow, we should just compute the number of segments and save it in a field and return than whenever required rather than freshly building the timeline every time we need the reports. ########## indexing-service/src/main/java/org/apache/druid/indexing/common/task/IndexTask.java: ########## @@ -1071,7 +1077,14 @@ private TaskStatus generateAndPublishSegments( log.debugSegments(published.getSegments(), "Published segments"); - updateAndWriteCompletionReports(toolbox); + updateAndWriteCompletionReports( + toolbox, + // only applicable to the compaction use cases + inputSource instanceof DruidInputSource + ? (long) ((DruidInputSource) inputSource).estimateSegmentsCount() Review Comment: @adithyachakilam , do you recall why we had kept these fields as long? Since they are the number of segments read and published, I don't think they will ever be bigger than an int (2 billion). Should we just make these fields as Integer? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org