clintropolis commented on a change in pull request #12073:
URL: https://github.com/apache/druid/pull/12073#discussion_r791614256
##########
File path:
processing/src/main/java/org/apache/druid/query/aggregation/AggregatorFactory.java
##########
@@ -69,6 +69,24 @@ public VectorAggregator
factorizeVector(VectorColumnSelectorFactory selectorFact
throw new UOE("Aggregator[%s] cannot vectorize", getClass().getName());
}
+ /**
+ * Creates an {@link Aggregator} based on the provided column selector
factory.
+ * The returned value is a holder object which contains both the aggregator
+ * and its initial size in bytes. The callers can then invoke
+ * {@link Aggregator#aggregateWithSize()} to perform aggregation and get back
+ * the incremental memory required in each aggregate call. Combined with the
+ * initial size, this gives the total on-heap memory required by the
aggregator.
+ *
+ * This flow does not require invoking {@link
#guessAggregatorHeapFootprint(long)}
+ * which tends to over-estimate the required memory.
+ *
+ * @return AggregatorAndSize which contains the actual aggregator and its
initial size.
+ */
+ public AggregatorAndSize factorizeWithSize(ColumnSelectorFactory
metricFactory)
Review comment:
same question about contract about returned sizes. Also, I wonder if
there is anything we could do to make sure this method is overridden if
`aggregateWithSize` is implemented, so that the initial size is not the max
size...
##########
File path:
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java
##########
@@ -60,20 +60,29 @@ private static String emptyToNullIfNeeded(@Nullable Object
o)
private final MultiValueHandling multiValueHandling;
private final boolean hasBitmapIndexes;
private final boolean hasSpatialIndexes;
+ private final boolean useMaxMemoryEstimates;
private volatile boolean hasMultipleValues = false;
- public StringDimensionIndexer(MultiValueHandling multiValueHandling, boolean
hasBitmapIndexes, boolean hasSpatialIndexes)
+ public StringDimensionIndexer(
+ MultiValueHandling multiValueHandling,
+ boolean hasBitmapIndexes,
+ boolean hasSpatialIndexes,
+ boolean useMaxMemoryEstimates
+ )
{
+ super(useMaxMemoryEstimates ? new DimensionDictionary<>() : new
StringDimensionDictionary());
Review comment:
nit: this seems strange/confusing, why wouldn't `StringDimensionIndexer`
always use `StringDimensionDictionary` here? it seems like could just pass in a
value of `false` to control the value of `computeOnHeapSize` instead of
sometimes not using `StringDimensionDictionary`
##########
File path:
processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java
##########
@@ -60,20 +60,29 @@ private static String emptyToNullIfNeeded(@Nullable Object
o)
private final MultiValueHandling multiValueHandling;
private final boolean hasBitmapIndexes;
private final boolean hasSpatialIndexes;
+ private final boolean useMaxMemoryEstimates;
private volatile boolean hasMultipleValues = false;
- public StringDimensionIndexer(MultiValueHandling multiValueHandling, boolean
hasBitmapIndexes, boolean hasSpatialIndexes)
+ public StringDimensionIndexer(
+ MultiValueHandling multiValueHandling,
+ boolean hasBitmapIndexes,
+ boolean hasSpatialIndexes,
+ boolean useMaxMemoryEstimates
+ )
{
+ super(useMaxMemoryEstimates ? new DimensionDictionary<>() : new
StringDimensionDictionary());
this.multiValueHandling = multiValueHandling == null ?
MultiValueHandling.ofDefault() : multiValueHandling;
this.hasBitmapIndexes = hasBitmapIndexes;
this.hasSpatialIndexes = hasSpatialIndexes;
+ this.useMaxMemoryEstimates = useMaxMemoryEstimates;
}
@Override
- public int[] processRowValsToUnsortedEncodedKeyComponent(@Nullable Object
dimValues, boolean reportParseExceptions)
+ public EncodedKeyComponent<int[]>
processRowValsToUnsortedEncodedKeyComponent(@Nullable Object dimValues, boolean
reportParseExceptions)
{
final int[] encodedDimensionValues;
final int oldDictSize = dimLookup.size();
+ final long oldDictSizeInBytes = useMaxMemoryEstimates ? 0 :
dimLookup.sizeInBytes();
Review comment:
nit: this is only used inside of the else near the end of the method
##########
File path:
extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/SketchAggregator.java
##########
@@ -21,23 +21,43 @@
import org.apache.datasketches.Family;
import org.apache.datasketches.theta.SetOperation;
+import org.apache.datasketches.theta.Sketch;
import org.apache.datasketches.theta.Union;
import org.apache.druid.common.config.NullHandling;
import org.apache.druid.java.util.common.ISE;
import org.apache.druid.query.aggregation.Aggregator;
import org.apache.druid.segment.BaseObjectColumnValueSelector;
import javax.annotation.Nullable;
+import java.lang.reflect.Field;
import java.util.List;
public class SketchAggregator implements Aggregator
{
+
private final BaseObjectColumnValueSelector selector;
private final int size;
@Nullable
private Union union;
+ @Nullable
+ private Sketch sketch;
+
+ @Nullable
+ private static final Field SKETCH_FIELD;
+
+ static {
+ try {
+ SKETCH_FIELD = Class.forName("org.apache.datasketches.theta.UnionImpl")
+ .getDeclaredField("gadget_");
+ SKETCH_FIELD.setAccessible(true);
+ }
+ catch (NoSuchFieldException | ClassNotFoundException e) {
+ throw new ISE(e, "Could not initialize SketchAggregator");
+ }
Review comment:
this seems worth a comment, and maybe a link to the code?
##########
File path:
processing/src/main/java/org/apache/druid/query/aggregation/Aggregator.java
##########
@@ -42,6 +42,12 @@
{
void aggregate();
+ default long aggregateWithSize()
Review comment:
what is the contract of this value, I don't see the word estimate in
here, but think it probably should be... should implementors over-estimate if
exact sizing is not possible or is under-estimating fine? Should there be a
warning that the default estimate is used? (i imagine this would be very noisy
if it is done per aggregate call... so don't really recommend doing it here or
anything...)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]