Re: [PR] Add numRows to DataSegment (druid)

via GitHub Mon, 12 Jan 2026 17:58:50 -0800


clintropolis commented on code in PR #18892:
URL: https://github.com/apache/druid/pull/18892#discussion_r2684533007



##########
server/src/main/java/org/apache/druid/segment/realtime/appenderator/BatchAppenderator.java:
##########
@@ -835,6 +836,14 @@ private DataSegmentWithMetadata mergeAndPush(
         metrics.setMergeTime(mergeTimeMillis);
 
         log.debug("Segment[%s] built in %,dms.", identifier, mergeTimeMillis);
+        QueryableIndex index = indexIO.loadIndex(mergedFile);
+        closer.register(index);

Review Comment:
   can this just use `sink.getNumRows()` or is that not cool for some reason?



##########
processing/src/main/java/org/apache/druid/query/metadata/SegmentMetadataQueryRunnerFactory.java:
##########
@@ -67,6 +68,59 @@ public SegmentMetadataQueryRunnerFactory(
     this.queryWatcher = queryWatcher;
   }
 
+  public QueryRunner<SegmentAnalysis> createRunner(SegmentReference segmentRef)
+  {
+    if (segmentRef.getDataSegment() == null) {
+      throw DruidException.defensive("Missing Segment[%s]", 
segmentRef.getSegmentDescriptor());
+    } else if (segmentRef.getSegmentReference().isEmpty()) {
+      throw DruidException.defensive("Missing Segment[%s]", 
segmentRef.getDataSegment().getId());
+    }
+
+    Segment segment = segmentRef.getSegmentReference().get();
+    if (!Objects.equals(segmentRef.getDataSegment().getId(), segment.getId())) 
{
+      throw DruidException.defensive(
+          "SegmentId mismatch in DataSegment[%s] and Segment[%s]",
+          segmentRef.getDataSegment().getId(),
+          segment.getId()
+      );
+    }
+
+    return new QueryRunner<>()
+    {
+      @Override
+      public Sequence<SegmentAnalysis> run(QueryPlus<SegmentAnalysis> inQ, 
ResponseContext responseContext)
+      {
+        SegmentMetadataQuery updatedQuery = ((SegmentMetadataQuery) 
inQ.getQuery()).withFinalizedAnalysisTypes(toolChest.getConfig());
+        final SegmentAnalyzer analyzer = new 
SegmentAnalyzer(updatedQuery.getAnalysisTypes());
+
+        Integer numRows = segmentRef.getDataSegment().getTotalRows();
+        if (numRows == null) {
+          numRows = analyzer.numRows(segment);
+        }
+        long totalSize = analyzer.analyzingSize() ? 
segmentRef.getDataSegment().getSize() : 0L;

Review Comment:
   this is is a breaking change to size analysis. While the current size 
analysis is arguably nonsense, I don't think we want to do an incompatible 
change like this, see discussion in 
https://github.com/apache/druid/issues/7124, and so this change should be 
reverted.
   
   I'm also not terribly sure the total size isn't particularly useful, I think 
the size analysis we want to replace this with would be a breakdown of column 
parts since that is actionable (e.g. can decide to not have indexes or whatever 
else). In other words, I don't think `DataSegment` is particularly useful to 
use here since we already have the `Segment`, so it doesn't save us anything, 
and costs us dragging around the overhead of an extra memory reference linking 
back to the `DataSegment` for anything a query engine processes



##########
processing/src/main/java/org/apache/druid/segment/SegmentReference.java:
##########
@@ -49,23 +51,51 @@ public static SegmentReference missing(SegmentDescriptor 
segmentDescriptor)
     return new SegmentReference(segmentDescriptor, Optional.empty(), null);
   }
 
+  @Nullable
+  private final DataSegment dataSegment;
   private final SegmentDescriptor segmentDescriptor;
   private final Optional<Segment> segmentReference;
   @Nullable
   private final Closeable cleanupHold;
   private final AtomicBoolean closed = new AtomicBoolean(false);
 
+  public SegmentReference(
+      DataSegmentAndDescriptor segment,
+      Optional<Segment> segmentReference,
+      @Nullable Closeable cleanupHold
+  )
+  {
+    this(segment.getDataSegment(), segment.getDescriptor(), segmentReference, 
cleanupHold);
+  }
+
   public SegmentReference(
       SegmentDescriptor segmentDescriptor,
       Optional<Segment> segmentReference,
       @Nullable Closeable cleanupHold
   )
   {
+    this(null, segmentDescriptor, segmentReference, cleanupHold);
+  }
+
+  private SegmentReference(
+      @Nullable DataSegment segment,

Review Comment:
   this change feels controversial to me, I don't love the idea of dragging 
around a reference to `DataSegment` everywhere we are using segments, what is 
this for. just for the segment metadata query changes?



##########
server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java:
##########
@@ -970,6 +971,14 @@ private DataSegmentWithMetadata mergeAndPush(
         metrics.setMergeTime(mergeTimeMillis);
 
         log.debug("Segment[%s] built in %,dms.", identifier, mergeTimeMillis);
+        QueryableIndex index = indexIO.loadIndex(mergedFile);
+        closer.register(index);

Review Comment:
   same comment about `sink.getNumRows()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add numRows to DataSegment (druid)

Reply via email to