[GitHub] [druid] paul-rogers commented on a diff in pull request #13887: Tuple sketch SQL support

via GitHub Fri, 17 Mar 2023 11:26:56 -0700


paul-rogers commented on code in PR #13887:
URL: https://github.com/apache/druid/pull/13887#discussion_r1140544380



##########
docs/querying/sql-aggregations.md:
##########
@@ -139,6 +139,16 @@ Load the [DataSketches 
extension](../development/extensions-core/datasketches-ex
 |`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles 
sketch](../development/extensions-core/datasketches-quantiles.md) on the values 
of `expr`, which can be a regular column or a column containing quantiles 
sketches. The `k` parameter is described in the Quantiles sketch 
documentation.<br/><br/>See the [known 
issue](sql-translation.md#approximations) with this function.|`'0'` (STRING)|
 
 
+### Tuple sketch functions
+
+Load the [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) to use the 
following functions.

Review Comment:
   Can we include a link to a description of tuple sketches, and why I'd want 
one, for newbies that don't already understand this concept?



##########
docs/querying/sql-functions.md:
##########
@@ -136,6 +136,42 @@ Returns an array of all values of the specified expression.
 
 Concatenates array inputs into a single array.
 
+## ARRAY_OF_DOUBLES_SKETCH
+

Review Comment:
   As above: perhaps a description of the purpose of these functions for those 
not already in the know?



##########
docs/querying/sql-functions.md:
##########
@@ -136,6 +136,42 @@ Returns an array of all values of the specified expression.
 
 Concatenates array inputs into a single array.
 
+## ARRAY_OF_DOUBLES_SKETCH
+
+`ARRAY_OF_DOUBLES_SKETCH(expr, [nominalEntries])`
+
+`ARRAY_OF_DOUBLES_SKETCH(dimensionColumnExpr, metricColumnExpr, ..., 
[nominalEntries])`
+
+**Function type:** [Aggregation](sql-aggregations.md)
+
+Creates a Tuple sketch.
+
+## ARRAY_OF_DOUBLES_SKETCH_INTERSECT
+
+`ARRAY_OF_DOUBLES_SKETCH_INTERSECT(expr, ..., [nominalEntries])`
+
+**Function type:** [Scalar, sketch](sql-scalar.md#tuple-sketch-functions)
+
+## ARRAY_OF_DOUBLES_SKETCH_NOT
+
+`ARRAY_OF_DOUBLES_SKETCH_NOT(expr, ..., [nominalEntries])`
+
+**Function type:** [Scalar, sketch](sql-scalar.md#tuple-sketch-functions)
+
+## ARRAY_OF_DOUBLES_SKETCH_UNION
+
+`ARRAY_OF_DOUBLES_SKETCH_UNION(expr, ..., [nominalEntries])`
+
+**Function type:** [Scalar, sketch](sql-scalar.md#tuple-sketch-functions)
+
+## ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE

Review Comment:
   This is quite a long name for a poor user to remember and type.
   
   First, might there be an `ARRAY_OF_LONGS` or `ARRAY_OF_STRINGS` sketch in 
the future? If so, common SQL practice is to use function overrides: 
`ARRAY_SKETCH()`, say, which works out the "OF_DOUBLES", etc. part based on the 
type of the argument. See how `LATEST()` handles this idea.
   
   Second, suppose we'll only ever do an array of doubles. Then, a common 
approach is to introduce an acronym to shorten names: `AODS`. Or, to make it 
shorter: "double array sketch" or "DAS".
   
   Then, the suffix only has to be enough to differentiate each function: 
`DAS_NOT`, `DAS_UNION`, `DAS_SUM`, etc.



##########
extensions-core/datasketches/src/test/java/org/apache/druid/query/aggregation/datasketches/tuple/sql/ArrayOfDoublesSketchSqlAggregatorTest.java:
##########
@@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.tuple.sql;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import com.google.inject.Injector;
+import org.apache.druid.data.input.InputRow;
+import org.apache.druid.guice.DruidInjectorBuilder;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.query.Druids;
+import org.apache.druid.query.QueryRunnerFactoryConglomerate;
+import org.apache.druid.query.aggregation.CountAggregatorFactory;
+import org.apache.druid.query.aggregation.LongSumAggregatorFactory;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchAggregatorFactory;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchModule;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchOperations;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchSetOpPostAggregator;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchToMetricsSumEstimatePostAggregator;
+import org.apache.druid.query.aggregation.post.ExpressionPostAggregator;
+import org.apache.druid.query.aggregation.post.FieldAccessPostAggregator;
+import org.apache.druid.query.dimension.DefaultDimensionSpec;
+import org.apache.druid.query.groupby.GroupByQuery;
+import org.apache.druid.segment.IndexBuilder;
+import org.apache.druid.segment.QueryableIndex;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.incremental.IncrementalIndexSchema;
+import org.apache.druid.segment.join.JoinableFactoryWrapper;
+import org.apache.druid.segment.virtual.ExpressionVirtualColumn;
+import 
org.apache.druid.segment.writeout.OffHeapMemorySegmentWriteOutMediumFactory;
+import org.apache.druid.sql.calcite.BaseCalciteQueryTest;
+import org.apache.druid.sql.calcite.filtration.Filtration;
+import org.apache.druid.sql.calcite.util.CalciteTests;
+import org.apache.druid.sql.calcite.util.SpecificSegmentsQuerySegmentWalker;
+import org.apache.druid.sql.calcite.util.TestDataBuilder;
+import org.apache.druid.timeline.DataSegment;
+import org.apache.druid.timeline.partition.LinearShardSpec;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class ArrayOfDoublesSketchSqlAggregatorTest extends BaseCalciteQueryTest
+{
+
+  private static final String DATA_SOURCE = "foo";
+
+  // built from ArrayOfDoublesUpdatableSketch.update("FEDCAB", new double[] 
{0.0}).compact()
+  private static final String COMPACT_BASE_64_ENCODED_SKETCH_FOR_INTERSECTION 
= "AQEJAwgBzJP/////////fwEAAAAAAAAAjFnadZuMrkgAAAAAAAAAAA==";
+
+  private static final List<InputRow> ROWS = ImmutableList.of(
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-01")
+                  .put("dim1", "CA")
+                  .put("dim2", "FEDCAB")
+                  .put("m1", 5)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-01")
+                  .put("dim1", "US")
+                  .put("dim2", "ABCDEF")
+                  .put("m1", 12)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-02")
+                  .put("dim1", "CA")
+                  .put("dim2", "FEDCAB")
+                  .put("m1", 3)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-02")
+                  .put("dim1", "US")
+                  .put("dim2", "ABCDEF")
+                  .put("m1", 8)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-03")
+                  .put("dim1", "US")
+                  .put("dim2", "ABCDEF")
+                  .put("m1", 2)
+                  .build()
+  ).stream().map(TestDataBuilder::createRow).collect(Collectors.toList());
+
+  @Override
+  public void configureGuice(DruidInjectorBuilder builder)
+  {
+    super.configureGuice(builder);
+    builder.addModule(new ArrayOfDoublesSketchModule());
+  }
+
+  @Override
+  public SpecificSegmentsQuerySegmentWalker createQuerySegmentWalker(
+      final QueryRunnerFactoryConglomerate conglomerate,
+      final JoinableFactoryWrapper joinableFactory,
+      final Injector injector
+  ) throws IOException
+  {
+    ArrayOfDoublesSketchModule.registerSerde();
+
+    final QueryableIndex index = IndexBuilder.create()
+                                             
.tmpDir(temporaryFolder.newFolder())
+                                             .segmentWriteOutMediumFactory(
+                                                 
OffHeapMemorySegmentWriteOutMediumFactory.instance()
+                                             )
+                                             .schema(
+                                                 new 
IncrementalIndexSchema.Builder()
+                                                     .withMetrics(
+                                                         new 
CountAggregatorFactory("cnt"),
+                                                         new 
ArrayOfDoublesSketchAggregatorFactory(
+                                                             
"tuplesketch_dim2",
+                                                             "dim2",
+                                                             null,
+                                                             
ImmutableList.of("m1"),
+                                                             1
+                                                         ),
+                                                         new 
LongSumAggregatorFactory("m1", "m1")
+                                                     )
+                                                     .withRollup(false)
+                                                     .build()
+                                             )
+                                             .rows(ROWS)
+                                             .buildMMappedIndex();
+
+    return new SpecificSegmentsQuerySegmentWalker(conglomerate).add(
+        DataSegment.builder()
+                   .dataSource(DATA_SOURCE)
+                   .interval(index.getDataInterval())
+                   .version("1")
+                   .shardSpec(new LinearShardSpec(0))
+                   .size(0)
+                   .build(),
+        index
+    );
+  }
+
+  @Test
+  public void testMetricsSumEstimate()
+  {
+    cannotVectorize();
+
+    final String sql = "SELECT\n"
+                   + "  dim1,\n"
+                   + "  SUM(cnt),\n"
+                   + "  
ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH(tuplesketch_dim2)),\n"
+                   + "  
ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH(dim2, 
m1)),\n"
+                   + "  
ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH(dim2, m1, 
256))\n"
+                   + "FROM druid.foo\n"
+                   + "GROUP BY dim1";
+
+    final List<Object[]> expectedResults;
+
+    expectedResults = ImmutableList.of(
+        new Object[]{
+            "CA",
+            2L,
+            "[8.0]",
+            "[8.0]",
+            "[8.0]"
+        },
+        new Object[]{
+            "US",
+            3L,
+            "[22.0]",
+            "[22.0]",
+            "[22.0]"
+        }
+    );
+
+    testQuery(
+        sql,
+        ImmutableList.of(
+            GroupByQuery.builder()
+                  .setDataSource(CalciteTests.DATASOURCE1)
+                  .setInterval(querySegmentSpec(Filtration.eternity()))
+                  .setGranularity(Granularities.ALL)
+                  .setDimensions(new DefaultDimensionSpec("dim1", "d0", 
ColumnType.STRING))
+                  .setAggregatorSpecs(
+                      aggregators(
+                          new LongSumAggregatorFactory("a0", "cnt"),
+                          new ArrayOfDoublesSketchAggregatorFactory(
+                                                             "a1",
+                                                             
"tuplesketch_dim2",
+                                                             null,
+                                                             null,
+                                                             null
+                          ),
+                          new ArrayOfDoublesSketchAggregatorFactory(
+                                                             "a2",
+                                                             "dim2",
+                                                             null,
+                                                             
ImmutableList.of("m1"),
+                                                             null
+                          ),
+                          new ArrayOfDoublesSketchAggregatorFactory(
+                                                             "a3",
+                                                             "dim2",
+                                                             256,
+                                                             
ImmutableList.of("m1"),
+                                                             null
+                          )
+                      )
+                  )
+                  .setPostAggregatorSpecs(
+                      ImmutableList.of(
+                         new 
ArrayOfDoublesSketchToMetricsSumEstimatePostAggregator(
+                           "p1",
+                           new FieldAccessPostAggregator("p0", "a1")
+                         ),
+                         new 
ArrayOfDoublesSketchToMetricsSumEstimatePostAggregator(
+                           "p3",
+                           new FieldAccessPostAggregator("p2", "a2")
+                         ),
+                         new 
ArrayOfDoublesSketchToMetricsSumEstimatePostAggregator(
+                           "p5",
+                           new FieldAccessPostAggregator("p4", "a3")
+                         )
+                      )
+                  )
+                  .setContext(QUERY_CONTEXT_DEFAULT)
+                  .build()
+        ),
+        expectedResults
+    );
+  }
+
+  @Test
+  public void testMetricsSumEstimateIntersect()
+  {
+    cannotVectorize();
+
+    final String sql = "SELECT\n"
+                   + "  SUM(cnt),\n"
+                   + "  
ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH(tuplesketch_dim2))
 AS all_sum_estimates,\n"
+                   + StringUtils.replace(
+                      
"ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH_INTERSECT(COMPLEX_DECODE_BASE64('arrayOfDoublesSketch',
 '%s'), ARRAY_OF_DOUBLES_SKETCH(tuplesketch_dim2), 128)) AS 
intersect_sum_estimates\n",
+                      "%s",
+                      COMPACT_BASE_64_ENCODED_SKETCH_FOR_INTERSECTION
+                   )
+                   + "FROM druid.foo";
+
+    final List<Object[]> expectedResults;
+
+    expectedResults = ImmutableList.of(
+        new Object[]{
+            5L,
+            "[30.0]",
+            "[8.0]"
+        }
+    );
+
+    final String expectedBase64Constant = "'"
+                                 + 
StringUtils.replace(COMPACT_BASE_64_ENCODED_SKETCH_FOR_INTERSECTION, "=", 
"\\u003D")
+                                 + "'";
+
+    testQuery(
+        sql,
+        ImmutableList.of(
+            Druids.newTimeseriesQueryBuilder()
+                  .dataSource(CalciteTests.DATASOURCE1)
+                  .intervals(querySegmentSpec(Filtration.eternity()))
+                  .granularity(Granularities.ALL)
+                  .aggregators(
+                      ImmutableList.of(
+                          new LongSumAggregatorFactory("a0", "cnt"),
+                          new ArrayOfDoublesSketchAggregatorFactory(
+                                                             "a1",
+                                                             
"tuplesketch_dim2",
+                                                             null,
+                                                             null,
+                                                             null
+                          )
+                      )
+                  )
+                  .postAggregators(
+                      ImmutableList.of(
+                         new 
ArrayOfDoublesSketchToMetricsSumEstimatePostAggregator(
+                           "p1",
+                           new FieldAccessPostAggregator("p0", "a1")
+                         ),
+                         new 
ArrayOfDoublesSketchToMetricsSumEstimatePostAggregator(
+                           "p5",
+                           new ArrayOfDoublesSketchSetOpPostAggregator(
+                             "p4",
+                             "INTERSECT",
+                             128,
+                             null,
+                             ImmutableList.of(
+                               new ExpressionPostAggregator(
+                                 "p2",
+                                 
"complex_decode_base64('arrayOfDoublesSketch'," + expectedBase64Constant + ")",
+                                 null,
+                                 queryFramework().macroTable()
+                               ),
+                               new FieldAccessPostAggregator("p3", "a1")
+                             )
+                           )
+                         )
+                      )
+                  )
+                  .context(QUERY_CONTEXT_DEFAULT)
+                  .build()
+        ),
+        expectedResults
+    );
+  }
+
+  @Test
+  public void testNullInputs()
+  {
+    cannotVectorize();
+
+    final String sql = "SELECT\n"

Review Comment:
   None of these test queries use aggregations (`GROUP BY` clause). Would these 
functions work in that case? Should we include tests to demonstrate how they'd 
work?



##########
docs/querying/sql-functions.md:
##########
@@ -136,6 +136,42 @@ Returns an array of all values of the specified expression.
 
 Concatenates array inputs into a single array.
 
+## ARRAY_OF_DOUBLES_SKETCH
+
+`ARRAY_OF_DOUBLES_SKETCH(expr, [nominalEntries])`
+
+`ARRAY_OF_DOUBLES_SKETCH(dimensionColumnExpr, metricColumnExpr, ..., 
[nominalEntries])`
+
+**Function type:** [Aggregation](sql-aggregations.md)
+
+Creates a Tuple sketch.
+
+## ARRAY_OF_DOUBLES_SKETCH_INTERSECT
+
+`ARRAY_OF_DOUBLES_SKETCH_INTERSECT(expr, ..., [nominalEntries])`
+
+**Function type:** [Scalar, sketch](sql-scalar.md#tuple-sketch-functions)
+
+## ARRAY_OF_DOUBLES_SKETCH_NOT
+
+`ARRAY_OF_DOUBLES_SKETCH_NOT(expr, ..., [nominalEntries])`
+
+**Function type:** [Scalar, sketch](sql-scalar.md#tuple-sketch-functions)
+
+## ARRAY_OF_DOUBLES_SKETCH_UNION
+
+`ARRAY_OF_DOUBLES_SKETCH_UNION(expr, ..., [nominalEntries])`
+
+**Function type:** [Scalar, sketch](sql-scalar.md#tuple-sketch-functions)
+
+## ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE

Review Comment:
   The set of functions here seem to mimic the new scheme that various Druid 
principals have outline: functions to create the intermediate types, and more 
functions to pull final values out of those intermediate types.
   
   That approach makes sense for a native query, where the query spells out the 
details of aggregation.
   
   For SQL, however, this pattern is not a good fit. SQL provides a single 
function that goes from inputs to finalized types. For example, for `LATEST()` 
at the native layer, we may eventually want:
   
   * `LATEST_LONG_SKETCH(string, long) -> COMPLEX<PAIR<string, long>>`
   * `LATEST_UNION(COMPLEX<PAIR<string, long>> COMPLEX<PAIR<string, long>>) -> 
COMPLEX<PAIR<string, long>>`
   * `LATEST_LEFT( COMPLEX<PAIR<string, long>>) -> string`
   * `LATEST_RIGHT(COMPLEX<PAIR<string, long>> -> long`
   
   However, SQL doesn't provide the semantics to use such functions: one cannot 
get "inside" of a `GROUP BY` to create the intermediate values, combine them, 
and then go "outside" of the `GROUP BY` to produce the finalized value.
   
   Instead, in SQL, there is one function, `LATEST(.)` which generates the 
internal forms when creating the native query.
   
   All of this is a long-winded way of asking, for SQL, do we just need an 
`APPROX_SUM(ARRAY<DOUBLE>)` function? The rest is worked out by the planner to 
handle doing approx sums for arrays of doubles?



##########
extensions-core/datasketches/src/test/java/org/apache/druid/query/aggregation/datasketches/tuple/sql/ArrayOfDoublesSketchSqlAggregatorTest.java:
##########
@@ -0,0 +1,452 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.tuple.sql;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableMap;
+import com.google.inject.Injector;
+import org.apache.druid.data.input.InputRow;
+import org.apache.druid.guice.DruidInjectorBuilder;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.query.Druids;
+import org.apache.druid.query.QueryRunnerFactoryConglomerate;
+import org.apache.druid.query.aggregation.CountAggregatorFactory;
+import org.apache.druid.query.aggregation.LongSumAggregatorFactory;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchAggregatorFactory;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchModule;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchOperations;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchSetOpPostAggregator;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchToMetricsSumEstimatePostAggregator;
+import org.apache.druid.query.aggregation.post.ExpressionPostAggregator;
+import org.apache.druid.query.aggregation.post.FieldAccessPostAggregator;
+import org.apache.druid.query.dimension.DefaultDimensionSpec;
+import org.apache.druid.query.groupby.GroupByQuery;
+import org.apache.druid.segment.IndexBuilder;
+import org.apache.druid.segment.QueryableIndex;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.incremental.IncrementalIndexSchema;
+import org.apache.druid.segment.join.JoinableFactoryWrapper;
+import org.apache.druid.segment.virtual.ExpressionVirtualColumn;
+import 
org.apache.druid.segment.writeout.OffHeapMemorySegmentWriteOutMediumFactory;
+import org.apache.druid.sql.calcite.BaseCalciteQueryTest;
+import org.apache.druid.sql.calcite.filtration.Filtration;
+import org.apache.druid.sql.calcite.util.CalciteTests;
+import org.apache.druid.sql.calcite.util.SpecificSegmentsQuerySegmentWalker;
+import org.apache.druid.sql.calcite.util.TestDataBuilder;
+import org.apache.druid.timeline.DataSegment;
+import org.apache.druid.timeline.partition.LinearShardSpec;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class ArrayOfDoublesSketchSqlAggregatorTest extends BaseCalciteQueryTest
+{
+
+  private static final String DATA_SOURCE = "foo";
+
+  // built from ArrayOfDoublesUpdatableSketch.update("FEDCAB", new double[] 
{0.0}).compact()
+  private static final String COMPACT_BASE_64_ENCODED_SKETCH_FOR_INTERSECTION 
= "AQEJAwgBzJP/////////fwEAAAAAAAAAjFnadZuMrkgAAAAAAAAAAA==";
+
+  private static final List<InputRow> ROWS = ImmutableList.of(
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-01")
+                  .put("dim1", "CA")
+                  .put("dim2", "FEDCAB")
+                  .put("m1", 5)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-01")
+                  .put("dim1", "US")
+                  .put("dim2", "ABCDEF")
+                  .put("m1", 12)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-02")
+                  .put("dim1", "CA")
+                  .put("dim2", "FEDCAB")
+                  .put("m1", 3)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-02")
+                  .put("dim1", "US")
+                  .put("dim2", "ABCDEF")
+                  .put("m1", 8)
+                  .build(),
+      ImmutableMap.<String, Object>builder()
+                  .put("t", "2000-01-03")
+                  .put("dim1", "US")
+                  .put("dim2", "ABCDEF")
+                  .put("m1", 2)
+                  .build()
+  ).stream().map(TestDataBuilder::createRow).collect(Collectors.toList());
+
+  @Override
+  public void configureGuice(DruidInjectorBuilder builder)
+  {
+    super.configureGuice(builder);
+    builder.addModule(new ArrayOfDoublesSketchModule());
+  }
+
+  @Override
+  public SpecificSegmentsQuerySegmentWalker createQuerySegmentWalker(
+      final QueryRunnerFactoryConglomerate conglomerate,
+      final JoinableFactoryWrapper joinableFactory,
+      final Injector injector
+  ) throws IOException
+  {
+    ArrayOfDoublesSketchModule.registerSerde();
+
+    final QueryableIndex index = IndexBuilder.create()
+                                             
.tmpDir(temporaryFolder.newFolder())
+                                             .segmentWriteOutMediumFactory(
+                                                 
OffHeapMemorySegmentWriteOutMediumFactory.instance()
+                                             )
+                                             .schema(
+                                                 new 
IncrementalIndexSchema.Builder()
+                                                     .withMetrics(
+                                                         new 
CountAggregatorFactory("cnt"),
+                                                         new 
ArrayOfDoublesSketchAggregatorFactory(
+                                                             
"tuplesketch_dim2",
+                                                             "dim2",
+                                                             null,
+                                                             
ImmutableList.of("m1"),
+                                                             1
+                                                         ),
+                                                         new 
LongSumAggregatorFactory("m1", "m1")
+                                                     )
+                                                     .withRollup(false)
+                                                     .build()
+                                             )
+                                             .rows(ROWS)
+                                             .buildMMappedIndex();
+
+    return new SpecificSegmentsQuerySegmentWalker(conglomerate).add(
+        DataSegment.builder()
+                   .dataSource(DATA_SOURCE)
+                   .interval(index.getDataInterval())
+                   .version("1")
+                   .shardSpec(new LinearShardSpec(0))
+                   .size(0)
+                   .build(),
+        index
+    );
+  }
+
+  @Test
+  public void testMetricsSumEstimate()
+  {
+    cannotVectorize();
+
+    final String sql = "SELECT\n"
+                   + "  dim1,\n"
+                   + "  SUM(cnt),\n"
+                   + "  
ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH(tuplesketch_dim2)),\n"
+                   + "  
ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH(dim2, 
m1)),\n"
+                   + "  
ARRAY_OF_DOUBLES_SKETCH_METRICS_SUM_ESTIMATE(ARRAY_OF_DOUBLES_SKETCH(dim2, m1, 
256))\n"

Review Comment:
   From a usability perspective, this is somewhat un-SQL. Having to compose two 
functions, with very long names, to get one result is unusual. Since SQL is 
often hand-written, SQL goes for convenience rather than breaking things down 
into their component parts. The above approach would be fine in Java, but is 
awkward in SQL.



##########
docs/querying/sql-functions.md:
##########
@@ -136,6 +136,42 @@ Returns an array of all values of the specified expression.
 
 Concatenates array inputs into a single array.
 
+## ARRAY_OF_DOUBLES_SKETCH
+
+`ARRAY_OF_DOUBLES_SKETCH(expr, [nominalEntries])`
+
+`ARRAY_OF_DOUBLES_SKETCH(dimensionColumnExpr, metricColumnExpr, ..., 
[nominalEntries])`
+
+**Function type:** [Aggregation](sql-aggregations.md)
+
+Creates a Tuple sketch.
+
+## ARRAY_OF_DOUBLES_SKETCH_INTERSECT
+
+`ARRAY_OF_DOUBLES_SKETCH_INTERSECT(expr, ..., [nominalEntries])`
+
+**Function type:** [Scalar, sketch](sql-scalar.md#tuple-sketch-functions)
+
+## ARRAY_OF_DOUBLES_SKETCH_NOT

Review Comment:
   A description of each function would be useful. I have to confess, I'm 
struggling to understand how one takes the logical `NOT` of a double, an array 
of doubles, or a sum of an array of doubles.



##########
extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/tuple/sql/ArrayOfDoublesSketchSqlAggregator.java:
##########
@@ -0,0 +1,191 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.datasketches.tuple.sql;
+
+import org.apache.calcite.rel.core.AggregateCall;
+import org.apache.calcite.rel.core.Project;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexLiteral;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.sql.SqlAggFunction;
+import org.apache.calcite.sql.SqlFunctionCategory;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.sql.type.InferTypes;
+import org.apache.calcite.sql.type.OperandTypes;
+import org.apache.calcite.util.Optionality;
+import org.apache.datasketches.Util;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.query.aggregation.AggregatorFactory;
+import 
org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchAggregatorFactory;
+import org.apache.druid.query.dimension.DefaultDimensionSpec;
+import org.apache.druid.query.dimension.DimensionSpec;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.column.RowSignature;
+import org.apache.druid.segment.column.ValueType;
+import org.apache.druid.sql.calcite.aggregation.Aggregation;
+import org.apache.druid.sql.calcite.aggregation.SqlAggregator;
+import org.apache.druid.sql.calcite.expression.DruidExpression;
+import org.apache.druid.sql.calcite.expression.Expressions;
+import org.apache.druid.sql.calcite.planner.Calcites;
+import org.apache.druid.sql.calcite.planner.PlannerContext;
+import org.apache.druid.sql.calcite.rel.VirtualColumnRegistry;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+public class ArrayOfDoublesSketchSqlAggregator implements SqlAggregator
+{
+
+  private static final SqlAggFunction FUNCTION_INSTANCE = new 
ArrayOfDoublesSqlAggFunction();
+  private static final String NAME = "ARRAY_OF_DOUBLES_SKETCH";
+
+  @Override
+  public SqlAggFunction calciteFunction()
+  {
+    return FUNCTION_INSTANCE;
+  }
+
+  @Nullable
+  @Override
+  public Aggregation toDruidAggregation(
+      PlannerContext plannerContext,
+      RowSignature rowSignature,
+      VirtualColumnRegistry virtualColumnRegistry,
+      RexBuilder rexBuilder,
+      String name,
+      AggregateCall aggregateCall,
+      Project project,
+      List<Aggregation> existingAggregations,
+      boolean finalizeAggregations
+  )
+  {
+
+    final List<Integer> argList = aggregateCall.getArgList();
+
+    // check last argument for nomimalEntries
+    final int nominalEntries;
+    final int metricExpressionEndIndex;
+    final int lastArgIndex = argList.size() - 1;
+    final RexNode potentialNominalEntriesArg = Expressions.fromFieldAccess(
+        rowSignature,
+        project,
+        argList.get(lastArgIndex)
+    );
+
+    if (potentialNominalEntriesArg.isA(SqlKind.LITERAL) &&
+        RexLiteral.value(potentialNominalEntriesArg) instanceof Number) {
+
+      nominalEntries = ((Number) 
RexLiteral.value(potentialNominalEntriesArg)).intValue();
+      metricExpressionEndIndex = lastArgIndex - 1;
+    } else {
+      nominalEntries = Util.DEFAULT_NOMINAL_ENTRIES;
+      metricExpressionEndIndex = lastArgIndex;
+    }
+
+    final List<String> fieldNames = new ArrayList<>();
+    for (int i = 0; i <= metricExpressionEndIndex; i++) {
+      final String fieldName;
+
+      final RexNode columnRexNode = Expressions.fromFieldAccess(
+          rowSignature,
+          project,
+          argList.get(i)
+      );

Review Comment:
   Druid likes to merge master rather than rebase. A rebase looses folks 
comment status, which folks kind of dislike.
   
   If your code conflicts with newer code, then, yes, please do merge master 
and resolve any conflicts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on a diff in pull request #13887: Tuple sketch SQL support

Reply via email to