Re: [PR] Preemptive restriction for queries with approximate count distinct on complex columns of unsupported type (druid)

via GitHub Wed, 17 Jul 2024 01:51:54 -0700


kgyrtkirk commented on code in PR #16682:
URL: https://github.com/apache/druid/pull/16682#discussion_r1680681396



##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/BuiltinApproxCountDistinctSqlAggregator.java:
##########
@@ -151,12 +165,22 @@ private static class 
BuiltinApproxCountDistinctSqlAggFunction extends SqlAggFunc
           SqlKind.OTHER_FUNCTION,
           ReturnTypes.explicit(SqlTypeName.BIGINT),
           InferTypes.VARCHAR_1024,
-          OperandTypes.ANY,
+          OperandTypes.or(
+              OperandTypes.STRING,
+              OperandTypes.NUMERIC,
+              
RowSignatures.complexTypeChecker(HyperUniquesAggregatorFactory.TYPE)
+          ),
           SqlFunctionCategory.STRING,
           false,
           false,
           Optionality.FORBIDDEN
       );
     }
   }
+
+  private boolean validateInputType(ColumnType columnType)

Review Comment:
   I think this method name is a bit unfortunate; 
   
   the method name suggests that it will deiced if an input type is valid; 
however it only accepts some complex types - I would expect a `true` from this 
method for simple types like `LONG` .
   
   it should be either renamed; or reorganized



##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/BuiltinApproxCountDistinctSqlAggregator.java:
##########
@@ -118,12 +120,14 @@ public Aggregation toDruidAggregation(
       }
 
       if (inputType.is(ValueType.COMPLEX)) {
-        aggregatorFactory = new HyperUniquesAggregatorFactory(
-            aggregatorName,
-            dimensionSpec.getOutputName(),
-            false,
-            true
-        );
+        if (validateInputType(inputType)) {

Review Comment:
   
   looking at the variables: `inputAccessor`, `arg`, `rexNode`, `dataType` - 
they are "connected":
   * the 1st if uses a natively converted expressions directcolumn property to 
look it back up in the inputaccessor
   * the else branch was going the other way and was asking for the type of the 
input argument
   * if the 1st argument was converted to an expression which directly read an 
input column; then that means the `rexNode`-s type must have been the same as 
`dataType` unless it couldn't have been a direct column access
   
   
![gif](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExY3YyN2VlaTIyYWhtbG9vcXM0Z2Q2bmJhNnpoaWpzZ2tuaGV1MzZ3byZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/39xDerRCIoX2WeUVBz/giphy.gif)
   
   so...
   ```
   inputAccessor.getInputRowSignature()
               .getColumnType(arg.getDirectColumn())
   ```
   is essentially the same but is only accurate if its a `DirectColumn` ; so 
   you could move this check outside alongside with:
   ```
         final RelDataType dataType = rexNode.getType();
   final ColumnType inputType = Calcites.getColumnTypeForRelDataType(dataType);
   ```
   and make it only there...
   
   



##########
extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/theta/sql/ThetaSketchBaseSqlAggregator.java:
##########
@@ -116,26 +121,38 @@ public Aggregation toDruidAggregation(
         );
       }
 
-      final DimensionSpec dimensionSpec;
-
-      if (columnArg.isDirectColumnAccess()) {
-        dimensionSpec = columnArg.getSimpleExtraction().toDimensionSpec(null, 
inputType);
-      } else {
-        String virtualColumnName = 
virtualColumnRegistry.getOrCreateVirtualColumnForExpression(
-            columnArg,
-            dataType
+      if (!inputType.is(ValueType.COMPLEX)) {

Review Comment:
   I think yes ; you are trying to infer backward from `aggregator == null` 
that the issue was that the aggregation is not supported with this type - 
although that might be true right now ; that could change in the future - if 
for some reason in some other cases the `aggregate` will become `null`; but in 
that case the check will still hit - and produce a misleading error...
   
   note: you don't need an `else` if there is a `return`
   I don't think its beneficial to keep running till the function end if its 
known that the operation contract was already violated
   ```
   if (inputType.is(ValueType.COMPLEX)) {
     plannerContext.setPlanningError(xyz);
     return null; 
   }
    // code
   ```



##########
sql/src/main/java/org/apache/druid/sql/calcite/aggregation/builtin/BuiltinApproxCountDistinctSqlAggregator.java:
##########
@@ -177,4 +177,10 @@ private static class 
BuiltinApproxCountDistinctSqlAggFunction extends SqlAggFunc
       );
     }
   }
+
+  private boolean validateInputComplexTypeName(ColumnType columnType)
+  {
+    return Objects.equals(columnType.getComplexTypeName(), 
HyperUniquesAggregatorFactory.TYPE.getComplexTypeName()) ||
+           Objects.equals(columnType.getComplexTypeName(), 
HyperUniquesAggregatorFactory.PRECOMPUTED_TYPE.getComplexTypeName());

Review Comment:
   I'm a big fan of only allowing what's absolutely necessary.



##########
extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/hll/sql/HllSketchApproxCountDistinctSqlAggregator.java:
##########
@@ -21,28 +21,68 @@
 
 import org.apache.calcite.sql.SqlAggFunction;
 import org.apache.calcite.sql.SqlFunctionCategory;
+import org.apache.calcite.sql.type.CastedLiteralOperandTypeCheckers;
 import org.apache.calcite.sql.type.InferTypes;
+import org.apache.calcite.sql.type.OperandTypes;
+import org.apache.calcite.sql.type.SqlSingleOperandTypeChecker;
 import org.apache.calcite.sql.type.SqlTypeFamily;
 import org.apache.calcite.sql.type.SqlTypeName;
 import org.apache.druid.java.util.common.StringEncoding;
+import org.apache.druid.java.util.common.StringUtils;
 import org.apache.druid.query.aggregation.AggregatorFactory;
+import 
org.apache.druid.query.aggregation.datasketches.hll.HllSketchBuildAggregatorFactory;
+import 
org.apache.druid.query.aggregation.datasketches.hll.HllSketchMergeAggregatorFactory;
 import 
org.apache.druid.query.aggregation.post.FinalizingFieldAccessPostAggregator;
 import org.apache.druid.sql.calcite.aggregation.Aggregation;
 import org.apache.druid.sql.calcite.aggregation.SqlAggregator;
 import org.apache.druid.sql.calcite.expression.OperatorConversions;
+import org.apache.druid.sql.calcite.table.RowSignatures;
 
 import java.util.Collections;
 
+/**
+ * Approximate count distinct aggregator using HLL sketches.
+ * Supported column types: String, Numeric, HLLSketchMerge, HLLSketchBuild.
+ */
 public class HllSketchApproxCountDistinctSqlAggregator extends 
HllSketchBaseSqlAggregator implements SqlAggregator
 {
   public static final String NAME = "APPROX_COUNT_DISTINCT_DS_HLL";
+
+  private static final SqlSingleOperandTypeChecker 
AGGREGATED_COLUMN_TYPE_CHECKER = OperandTypes.or(
+      OperandTypes.STRING,
+      OperandTypes.NUMERIC,
+      RowSignatures.complexTypeChecker(HllSketchMergeAggregatorFactory.TYPE),
+      RowSignatures.complexTypeChecker(HllSketchBuildAggregatorFactory.TYPE)
+  );
+
   private static final SqlAggFunction FUNCTION_INSTANCE =
       OperatorConversions.aggregatorBuilder(NAME)
-                         .operandNames("column", "lgK", "tgtHllType")
-                         .operandTypes(SqlTypeFamily.ANY, 
SqlTypeFamily.NUMERIC, SqlTypeFamily.STRING)
                          .operandTypeInference(InferTypes.VARCHAR_1024)
-                         .requiredOperandCount(1)
-                         .literalOperands(1, 2)
+                         .operandTypeChecker(
+                             OperandTypes.or(
+                                 // APPROX_COUNT_DISTINCT_DS_HLL(column)
+                                 
OperandTypes.and(AGGREGATED_COLUMN_TYPE_CHECKER, 
OperandTypes.family(SqlTypeFamily.ANY)),

Review Comment:
   yes - we could someday rewise those as well...there are many branch of this 
`or` because the functions mandates only the 1st argument like:
   ```
   fn(column, [lgk, [tgtHllType]])
   ```
   
   but in my last comment I just meaned that:
   ```
   OperandTypes.and(AGGREGATED_COLUMN_TYPE_CHECKER, 
OperandTypes.family(SqlTypeFamily.ANY))
   ```
   could possibly be
   ```
   AGGREGATED_COLUMN_TYPE_CHECKER
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Preemptive restriction for queries with approximate count distinct on complex columns of unsupported type (druid)

Reply via email to