clintropolis commented on a change in pull request #12073:
URL: https://github.com/apache/druid/pull/12073#discussion_r791614256



##########
File path: 
processing/src/main/java/org/apache/druid/query/aggregation/AggregatorFactory.java
##########
@@ -69,6 +69,24 @@ public VectorAggregator 
factorizeVector(VectorColumnSelectorFactory selectorFact
     throw new UOE("Aggregator[%s] cannot vectorize", getClass().getName());
   }
 
+  /**
+   * Creates an {@link Aggregator} based on the provided column selector 
factory.
+   * The returned value is a holder object which contains both the aggregator
+   * and its initial size in bytes. The callers can then invoke
+   * {@link Aggregator#aggregateWithSize()} to perform aggregation and get back
+   * the incremental memory required in each aggregate call. Combined with the
+   * initial size, this gives the total on-heap memory required by the 
aggregator.
+   *
+   * This flow does not require invoking {@link 
#guessAggregatorHeapFootprint(long)}
+   * which tends to over-estimate the required memory.
+   *
+   * @return AggregatorAndSize which contains the actual aggregator and its 
initial size.
+   */
+  public AggregatorAndSize factorizeWithSize(ColumnSelectorFactory 
metricFactory)

Review comment:
       same question about contract about returned sizes. Also, I wonder if 
there is anything we could do to make sure this method is overridden if 
`aggregateWithSize` is implemented, so that the initial size is not the max 
size...

##########
File path: 
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java
##########
@@ -60,20 +60,29 @@ private static String emptyToNullIfNeeded(@Nullable Object 
o)
   private final MultiValueHandling multiValueHandling;
   private final boolean hasBitmapIndexes;
   private final boolean hasSpatialIndexes;
+  private final boolean useMaxMemoryEstimates;
   private volatile boolean hasMultipleValues = false;
 
-  public StringDimensionIndexer(MultiValueHandling multiValueHandling, boolean 
hasBitmapIndexes, boolean hasSpatialIndexes)
+  public StringDimensionIndexer(
+      MultiValueHandling multiValueHandling,
+      boolean hasBitmapIndexes,
+      boolean hasSpatialIndexes,
+      boolean useMaxMemoryEstimates
+  )
   {
+    super(useMaxMemoryEstimates ? new DimensionDictionary<>() : new 
StringDimensionDictionary());

Review comment:
       nit: this seems strange/confusing, why wouldn't `StringDimensionIndexer` 
always use `StringDimensionDictionary` here? it seems like could just pass in a 
value of `false` to control the value of `computeOnHeapSize` instead of 
sometimes not using `StringDimensionDictionary`

##########
File path: 
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java
##########
@@ -60,20 +60,29 @@ private static String emptyToNullIfNeeded(@Nullable Object 
o)
   private final MultiValueHandling multiValueHandling;
   private final boolean hasBitmapIndexes;
   private final boolean hasSpatialIndexes;
+  private final boolean useMaxMemoryEstimates;
   private volatile boolean hasMultipleValues = false;
 
-  public StringDimensionIndexer(MultiValueHandling multiValueHandling, boolean 
hasBitmapIndexes, boolean hasSpatialIndexes)
+  public StringDimensionIndexer(
+      MultiValueHandling multiValueHandling,
+      boolean hasBitmapIndexes,
+      boolean hasSpatialIndexes,
+      boolean useMaxMemoryEstimates
+  )
   {
+    super(useMaxMemoryEstimates ? new DimensionDictionary<>() : new 
StringDimensionDictionary());
     this.multiValueHandling = multiValueHandling == null ? 
MultiValueHandling.ofDefault() : multiValueHandling;
     this.hasBitmapIndexes = hasBitmapIndexes;
     this.hasSpatialIndexes = hasSpatialIndexes;
+    this.useMaxMemoryEstimates = useMaxMemoryEstimates;
   }
 
   @Override
-  public int[] processRowValsToUnsortedEncodedKeyComponent(@Nullable Object 
dimValues, boolean reportParseExceptions)
+  public EncodedKeyComponent<int[]> 
processRowValsToUnsortedEncodedKeyComponent(@Nullable Object dimValues, boolean 
reportParseExceptions)
   {
     final int[] encodedDimensionValues;
     final int oldDictSize = dimLookup.size();
+    final long oldDictSizeInBytes = useMaxMemoryEstimates ? 0 : 
dimLookup.sizeInBytes();

Review comment:
       nit: this is only used inside of the else near the end of the method

##########
File path: 
extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java
##########
@@ -21,23 +21,43 @@
 
 import org.apache.datasketches.Family;
 import org.apache.datasketches.theta.SetOperation;
+import org.apache.datasketches.theta.Sketch;
 import org.apache.datasketches.theta.Union;
 import org.apache.druid.common.config.NullHandling;
 import org.apache.druid.java.util.common.ISE;
 import org.apache.druid.query.aggregation.Aggregator;
 import org.apache.druid.segment.BaseObjectColumnValueSelector;
 
 import javax.annotation.Nullable;
+import java.lang.reflect.Field;
 import java.util.List;
 
 public class SketchAggregator implements Aggregator
 {
+
   private final BaseObjectColumnValueSelector selector;
   private final int size;
 
   @Nullable
   private Union union;
 
+  @Nullable
+  private Sketch sketch;
+
+  @Nullable
+  private static final Field SKETCH_FIELD;
+
+  static {
+    try {
+      SKETCH_FIELD = Class.forName("org.apache.datasketches.theta.UnionImpl")
+                          .getDeclaredField("gadget_");
+      SKETCH_FIELD.setAccessible(true);
+    }
+    catch (NoSuchFieldException | ClassNotFoundException e) {
+      throw new ISE(e, "Could not initialize SketchAggregator");
+    }

Review comment:
       this seems worth a comment, and maybe a link to the code?

##########
File path: 
processing/src/main/java/org/apache/druid/query/aggregation/Aggregator.java
##########
@@ -42,6 +42,12 @@
 {
   void aggregate();
 
+  default long aggregateWithSize()

Review comment:
       what is the contract of this value, I don't see the word estimate in 
here, but think it probably should be... should implementors over-estimate if 
exact sizing is not possible or is under-estimating fine? Should there be a 
warning that the default estimate is used? (i imagine this would be very noisy 
if it is done per aggregate call... so don't really recommend doing it here or 
anything...)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to