Re: [PR] Populate segment stats for non-parallel compaction jobs (druid)

via GitHub Thu, 28 Mar 2024 20:11:48 -0700


kfaraz commented on code in PR #16171:
URL: https://github.com/apache/druid/pull/16171#discussion_r1543990185



##########
indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java:
##########
@@ -620,4 +634,23 @@ public static List<TimelineObjectHolder<String, 
DataSegment>> getTimelineForSegm
 
     return new ArrayList<>(timeline.values());
   }
+
+  /**
+   * Should ideally be called after creating the reader with {@link 
#reader(InputRowSchema, InputFormat, File)}
+   * call to not be a costly  operation.
+   *
+   * @return count of segments handled by this input source
+   */
+  public int segmentsCount()

Review Comment:
   If the segment count has not been computed yet, this method should just 
return null instead of having a fallback.
   Ideally, the fallback should never be invoked (we can verify this in a test 
cluster).
   But if we keep a fallback, it would be difficult to know what happened 
inside this method.



##########
indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java:
##########
@@ -155,6 +157,8 @@ public class DruidInputSource extends AbstractInputSource 
implements SplittableI
 
   @Nullable
   private final TaskToolbox toolbox;
+  @Nullable
+  private Integer segmentsCount;

Review Comment:
   ```suggestion
     private Integer numSegmentsInTimeline;
   ```



##########
indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java:
##########
@@ -620,4 +634,23 @@ public static List<TimelineObjectHolder<String, 
DataSegment>> getTimelineForSegm
 
     return new ArrayList<>(timeline.values());
   }
+
+  /**
+   * Should ideally be called after creating the reader with {@link 
#reader(InputRowSchema, InputFormat, File)}
+   * call to not be a costly  operation.
+   *
+   * @return count of segments handled by this input source
+   */
+  public int segmentsCount()

Review Comment:
   ```suggestion
     public Integer getNumberOfSegmentsRead()
   ```



##########
indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java:
##########
@@ -620,4 +634,23 @@ public static List<TimelineObjectHolder<String, 
DataSegment>> getTimelineForSegm
 
     return new ArrayList<>(timeline.values());
   }
+
+  /**
+   * Should ideally be called after creating the reader with {@link 
#reader(InputRowSchema, InputFormat, File)}
+   * call to not be a costly  operation.
+   *
+   * @return count of segments handled by this input source
+   */
+  public int segmentsCount()
+  {
+    if (segmentIds != null) {
+      return segmentIds.size();
+    }
+
+    if (segmentsCount == null) {
+      createTimeline();
+    }
+
+    return segmentsCount;

Review Comment:
   I don't think we need to use the `segmentIds` in any case, we can just stick 
to using the count from the timeline.
   
   ```suggestion
       return segmentCount;
   ```



##########
indexing-service/src/test/java/org/apache/druid/indexing/common/task/IngestionTestBase.java:
##########
@@ -114,6 +119,7 @@ public abstract class IngestionTestBase extends 
InitializedNullHandlingTest
   private SegmentsMetadataManager segmentsMetadataManager;
   private TaskLockbox lockbox;
   private File baseDir;
+  protected File reportsFile;

Review Comment:
   Ah, never mind, just saw that it is being used in 
`AbstractParallelTaskTestAsWell`.
   
   However, could you make this field private and expose a `protected 
getReportsFile()` method? We don't want any test to reassign this variable 
(unless the test needs to write out multiple different reports, but that 
doesn't seem to be the case right now).



##########
indexing-service/src/main/java/org/apache/druid/indexing/input/DruidInputSource.java:
##########
@@ -620,4 +634,23 @@ public static List<TimelineObjectHolder<String, 
DataSegment>> getTimelineForSegm
 
     return new ArrayList<>(timeline.values());
   }
+
+  /**
+   * Should ideally be called after creating the reader with {@link 
#reader(InputRowSchema, InputFormat, File)}
+   * call to not be a costly  operation.
+   *
+   * @return count of segments handled by this input source

Review Comment:
   ```suggestion
      * @return Number of segments read by this input source. This value is 
null until the method {@link #fixedFormatReader} has been invoked on this input 
source.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Re: [PR] Populate segment stats for non-parallel compaction jobs (druid)

Reply via email to