[GitHub] [druid] clintropolis commented on a change in pull request #12078: Grouping on arrays as arrays

GitBox Wed, 22 Dec 2021 18:45:44 -0800


clintropolis commented on a change in pull request #12078:
URL: https://github.com/apache/druid/pull/12078#discussion_r773583393




##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
##########
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby.epinephelinae.column;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.groupby.ResultRow;
+import org.apache.druid.query.groupby.epinephelinae.Grouper;
+import org.apache.druid.query.ordering.StringComparator;
+import org.apache.druid.query.ordering.StringComparators;
+import org.apache.druid.segment.ColumnValueSelector;
+import org.apache.druid.segment.DimensionSelector;
+import org.apache.druid.segment.data.ComparableIntArray;
+import org.apache.druid.segment.data.ComparableStringArray;
+import org.apache.druid.segment.data.IndexedInts;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class ArrayGroupByColumnSelectorStrategy

Review comment:
       this class only handles strings, so its name is perhaps incorrect, or it 
should be made more generic to handle other types of arrays

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
##########
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby.epinephelinae.column;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.groupby.ResultRow;
+import org.apache.druid.query.groupby.epinephelinae.Grouper;
+import org.apache.druid.query.ordering.StringComparator;
+import org.apache.druid.query.ordering.StringComparators;
+import org.apache.druid.segment.ColumnValueSelector;
+import org.apache.druid.segment.DimensionSelector;
+import org.apache.druid.segment.data.ComparableIntArray;
+import org.apache.druid.segment.data.ComparableStringArray;
+import org.apache.druid.segment.data.IndexedInts;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class ArrayGroupByColumnSelectorStrategy
+    implements GroupByColumnSelectorStrategy
+{
+  private static final int GROUP_BY_MISSING_VALUE = -1;
+
+
+  // contains string <-> id for each element of the multi value grouping column
+  // for eg : [a,b,c] is the col value. dictionaryToInt will contain { a <-> 
1, b <-> 2, c <-> 3}
+  private final BiMap<String, Integer> dictionaryToInt;
+
+  // stores each row as a integer array where the int represents the value in 
dictionaryToInt
+  // for eg : [a,b,c] would be converted to [1,2,3] and assigned a integer 
value 1.
+  // [1,2,3] <-> 1
+  private final BiMap<ComparableIntArray, Integer> intListToInt;
+
+  @Override
+  public int getGroupingKeySize()
+  {
+    return Integer.BYTES;
+  }
+
+  public ArrayGroupByColumnSelectorStrategy()
+  {
+    dictionaryToInt = HashBiMap.create();
+    intListToInt = HashBiMap.create();
+  }
+
+  @VisibleForTesting
+  ArrayGroupByColumnSelectorStrategy(
+      BiMap<String, Integer> dictionaryToInt,
+      BiMap<ComparableIntArray, Integer> intArrayToInt
+  )
+  {
+    this.dictionaryToInt = dictionaryToInt;
+    this.intListToInt = intArrayToInt;
+  }
+
+  @Override
+  public void processValueFromGroupingKey(
+      GroupByColumnSelectorPlus selectorPlus,
+      ByteBuffer key,
+      ResultRow resultRow,
+      int keyBufferPosition
+  )
+  {
+    final int id = key.getInt(keyBufferPosition);
+
+    // GROUP_BY_MISSING_VALUE is used to indicate empty rows
+    if (id != GROUP_BY_MISSING_VALUE) {
+      final int[] intRepresentation = intListToInt.inverse()
+                                                  .get(id).getDelegate();
+      final String[] stringRepresentaion = new 
String[intRepresentation.length];
+      for (int i = 0; i < intRepresentation.length; i++) {
+        stringRepresentaion[i] = 
dictionaryToInt.inverse().get(intRepresentation[i]);
+      }
+      resultRow.set(selectorPlus.getResultRowPosition(), 
ComparableStringArray.of(stringRepresentaion));
+    } else {
+      resultRow.set(selectorPlus.getResultRowPosition(), 
ComparableStringArray.of(NullHandling.defaultStringValues()));

Review comment:
       why wouldn't we use null for null? this seems incorrect to use an empty 
array, and comparators down the line seem to use nulls first so having nulls 
seems like it would be ok. am i missing something?

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/ResultRowDeserializer.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby;
+
+import com.fasterxml.jackson.core.JsonParser;
+import com.fasterxml.jackson.core.JsonToken;
+import com.fasterxml.jackson.databind.DeserializationContext;
+import com.fasterxml.jackson.databind.JsonDeserializer;
+import org.apache.druid.data.input.Row;
+import org.apache.druid.java.util.common.ISE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.column.RowSignature;
+import org.apache.druid.segment.data.ComparableStringArray;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+
+public class ResultRowDeserializer extends JsonDeserializer<ResultRow>
+{
+  final List<ColumnType> types;
+  final GroupByQuery query;
+
+  public ResultRowDeserializer(final List<ColumnType> types, final 
GroupByQuery query)
+  {
+    this.types = types;
+    this.query = query;
+  }
+
+  public static ResultRowDeserializer fromQuery(
+      final GroupByQuery query
+  )
+  {
+    RowSignature rowSignature = query.getResultRowSignature();
+    final List<ColumnType> types = new ArrayList<>(rowSignature.size());
+
+    for (String name : rowSignature.getColumnNames()) {
+      final ColumnType type = rowSignature.getColumnType(name)
+                                          .orElseThrow(() -> new ISE("No type 
for column [%s]", name));
+
+      types.add(type);
+    }
+
+    return new ResultRowDeserializer(types, query);
+
+  }
+
+  @Override
+  public ResultRow deserialize(JsonParser jp, DeserializationContext ctxt) 
throws IOException
+  {
+    // Deserializer that can deserialize either array- or map-based rows.
+    if (jp.isExpectedStartObjectToken()) {
+      final Row row = jp.readValueAs(Row.class);
+      return ResultRow.fromLegacyRow(row, query);
+    } else if (jp.isExpectedStartArrayToken()) {
+      final Object[] retVal = new Object[types.size()];
+
+      for (int i = 0; i < types.size(); i++) {
+        final JsonToken token = jp.nextToken();
+        switch (types.get(i).getType()) {
+          case STRING:
+            if (token == JsonToken.VALUE_NULL) {
+              retVal[i] = null;
+            } else if (token == JsonToken.VALUE_STRING) {
+              retVal[i] = jp.getText();
+            } else {
+              throw ctxt.instantiationException(
+                  ResultRow.class,
+                  StringUtils.format("Unexpected token [%s] when reading 
string", token)
+              );
+            }
+            break;
+
+          case LONG:
+            retVal[i] = token == JsonToken.VALUE_NULL ? null : 
jp.getLongValue();
+            break;
+          case DOUBLE:
+            retVal[i] = token == JsonToken.VALUE_NULL ? null : 
jp.getDoubleValue();
+            break;
+          case FLOAT:
+            retVal[i] = token == JsonToken.VALUE_NULL ? null : 
jp.getFloatValue();
+            break;
+          case ARRAY:
+            if (types.get(i).equals(ColumnType.STRING_ARRAY)) {
+              final List<String> strings = new ArrayList<>();
+              while (jp.nextToken() != JsonToken.END_ARRAY) {
+                strings.add(jp.getText());
+              }
+              retVal[i] = ComparableStringArray.of(strings.toArray(new 
String[0]));
+              break;
+            }
+
+          default:
+            throw new ISE("Can't handle type [%s]", 
types.get(i).asTypeString());

Review comment:
       I'm a bit worried about this method. On the one hand it is nice because 
it preserves the types exactly, but ... what about complex types? they often 
serialize as base64 strings, but could register potentially anything with the 
mapper. And what about other types of arrays, including arrays of arrays? 
Additionally, type is still allowed to be null if it is unknown afaik.
   
   I'm also worried if there might be any performance implications here, this 
change should be measured I think if we really want to do it, but I can't help 
but wonder if something down the line should just be wrapping arrays to perform 
comparisons on them, lazily only when they need comparisons, instead of trying 
to do it here, and whether or not deserializing directly into 
`ComparableStringArray` will trip up other things down the line expecting to 
run into `List`
   
   

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java
##########
@@ -1357,6 +1373,12 @@ private RowBasedKeySerdeHelper makeSerdeHelper(
     )
     {
       switch (valueType.getType()) {
+        case ARRAY:
+          return new ArrayRowBasedKeySerdeHelper(

Review comment:
       this doesn't look like it handles all arrays, though using the 
`TypeStrategy` added in #11888 would maybe give the necessary comparators to 
handle the other types of arrays, but `arrayDictionary` would need to be able 
to hold any type of array, not just strings.

##########
File path: 
sql/src/main/java/org/apache/druid/sql/calcite/expression/builtin/ArrayConstructorOperatorConversion.java
##########
@@ -45,6 +53,22 @@ public DruidExpression toDruidExpression(
       final RexNode rexNode
   )
   {
+    // Check if array needs to be unnested
+    if (plannerContext.getQueryContext()
+                      .getOrDefault(
+                          QueryContexts.ENABLE_UNNESTED_ARRAYS_KEY,
+                          QueryContexts.DEFAULT_ENABLE_UNNESTED_ARRAYS
+                      ).equals(Boolean.FALSE)) {
+      List<RexNode> nodes = ((RexCall) rexNode).getOperands();
+      Preconditions.checkArgument(
+          nodes == null || nodes.size() != 1,
+          "ARRAY[] should have exactly one argument"
+      );
+      if (nodes.get(0).getKind() == SqlKind.LITERAL) {
+        throw new UOE("ARRAY[] support for literals not implemented");
+      }
+      return Expressions.toDruidExpression(plannerContext, rowSignature, 
nodes.get(0));
+    }

Review comment:
       I don't think this needs a flag or any code change here if we allow 
arrays in SQL to remain arrays to the native layer (see other comment on 
consolidating `Calcites.getColumnTypeForRelDataType` and 
`Calcites.getValueTypeForRelDataTypeFull`)
   
   calling the array constructor on a multi-value column today doesn't work 
correctly anyway, because it actually attempts to map the array constructor 
function across each of the values of the multi-value string resulting in an a 
nested array of single element string arrays, instead of simply converting the 
multi-value string into a string array. So, changing that behavior doesn't seem 
problematic since I'm not sure the mapping is very intuitive.
   
   Additionally, i think if this flag did exist, and this flag was set, it 
would break all other uses of the array constructor function, which _do_ work 
today.

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/GroupByQueryEngineV2.java
##########
@@ -386,13 +393,16 @@ public static boolean 
canPushDownLimit(ColumnSelectorFactory columnSelectorFacto
     @Override
     public GroupByColumnSelectorStrategy makeColumnSelectorStrategy(
         ColumnCapabilities capabilities,
-        ColumnValueSelector selector
+        ColumnValueSelector selector,
+        DimensionSpec dimensionSpec
     )
     {
       switch (capabilities.getType()) {
         case STRING:

Review comment:
       so, this PR doesn't _actually_ handle grouping on array typed columns 
yet, such as produced by virtual columns today with array functions, which is 
sort of a bummer (and part of why I'm pushing for having a more generic 
implementation of the grouping strategy in the first pass in a different 
comment). If you do plan to not do this yet, the PR title and description 
should be modified to indicate that it allows grouping on multi-value string 
'STRING' typed columns as arrays or similar

##########
File path: 
processing/src/main/java/org/apache/druid/query/dimension/ColumnSelectorStrategyFactory.java
##########
@@ -24,5 +24,9 @@
 
 public interface ColumnSelectorStrategyFactory<ColumnSelectorStrategyClass 
extends ColumnSelectorStrategy>
 {
-  ColumnSelectorStrategyClass makeColumnSelectorStrategy(ColumnCapabilities 
capabilities, ColumnValueSelector selector);
+  ColumnSelectorStrategyClass makeColumnSelectorStrategy(
+      ColumnCapabilities capabilities,
+      ColumnValueSelector selector,
+      DimensionSpec dimensionSpec

Review comment:
       it seems sort of strange to pass a `DimensionSpec` into this method, 
maybe instead the caller should modify the `ColumnCapabilities` instead of 
pushing this in?

##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java
##########
@@ -403,7 +404,17 @@ private static Grouping computeGrouping(
       }
 
       final RelDataType dataType = rexNode.getType();
-      final ColumnType outputType = 
Calcites.getColumnTypeForRelDataType(dataType);
+      final ColumnType outputType;
+      if (plannerContext.getQueryContext()
+                        .getOrDefault(
+                            QueryContexts.ENABLE_UNNESTED_ARRAYS_KEY,
+                            QueryContexts.DEFAULT_ENABLE_UNNESTED_ARRAYS
+                        ).equals(Boolean.FALSE)) {
+        outputType = Calcites.getValueTypeForRelDataTypeFull(dataType);
+      } else {
+        outputType = Calcites.getColumnTypeForRelDataType(dataType);
+      }

Review comment:
       if grouping on arrays supported _all_ array types, the correct thing to 
do instead of a flag I think would to just be consolidate 
`getColumnTypeForRelDataType` and `getValueTypeForRelDataTypeFull` to remove 
the string coercion and let arrays stay as arrays. They are separate basically 
because we allow using array functions on string typed columns, but grouping on 
them requires they remain as string typed in the native layer, and if that were 
no longer true we wouldn't have to do this coercion any longer.

##########
File path: 
processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -376,11 +373,37 @@ public static Float convertObjectToFloat(@Nullable Object 
valObj, boolean report
         return convertObjectToDouble(obj, reportParseExceptions);
       case STRING:
         return convertObjectToString(obj);
+      case ARRAY:
+        return convertToArray(obj);

Review comment:
       Does this change mean that anything that deals with arrays that go 
through here would need to know how to deal with `ComparableStringArray` 
instead of the actual arrays? Additionally, this method only seems to handle 
string arrays, it should either check that the element type is string and fail 
otherwise, or it should be a more generic type that can handle comparison of 
any type of array

##########
File path: 
processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChestTest.java
##########
@@ -573,7 +572,7 @@ public void testResultSerde() throws Exception
         .setQuerySegmentSpec(QueryRunnerTestHelper.FIRST_TO_THIRD)
         
.setDimensions(Collections.singletonList(DefaultDimensionSpec.of("test")))
         
.setAggregatorSpecs(Collections.singletonList(QueryRunnerTestHelper.ROWS_COUNT))
-        .setPostAggregatorSpecs(Collections.singletonList(new 
ConstantPostAggregator("post", 10)))
+        .setPostAggregatorSpecs(Collections.singletonList(new 
ConstantPostAggregator("post", 10.0)))

Review comment:
       i assume these changes are from the serializer change?

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/ArrayGroupByColumnSelectorStrategy.java
##########
@@ -0,0 +1,224 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby.epinephelinae.column;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.groupby.ResultRow;
+import org.apache.druid.query.groupby.epinephelinae.Grouper;
+import org.apache.druid.query.ordering.StringComparator;
+import org.apache.druid.query.ordering.StringComparators;
+import org.apache.druid.segment.ColumnValueSelector;
+import org.apache.druid.segment.DimensionSelector;
+import org.apache.druid.segment.data.ComparableIntArray;
+import org.apache.druid.segment.data.ComparableStringArray;
+import org.apache.druid.segment.data.IndexedInts;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class ArrayGroupByColumnSelectorStrategy
+    implements GroupByColumnSelectorStrategy
+{
+  private static final int GROUP_BY_MISSING_VALUE = -1;
+
+
+  // contains string <-> id for each element of the multi value grouping column
+  // for eg : [a,b,c] is the col value. dictionaryToInt will contain { a <-> 
1, b <-> 2, c <-> 3}
+  private final BiMap<String, Integer> dictionaryToInt;

Review comment:
       i wonder if maybe this a premature optimization? I think we should 
consider making a more generic implementation first, that just stores a lookup 
of int to full arrays, instead of arrays of ints, and preferably one that can 
handle _all_ types of arrays. With that in hand, it would be a lot easier to 
measure various other optimizations against that baseline.
   
   Not to say that this isn't the best approach to handling arrays of strings, 
it might very well be. But, I think there is also potentially a lot of overhead 
in this approach, since you'll need to be doing extra function calls, and saves 
the most memory footprint when there is low value cardinality, and to the 
largest extent only for arrays of variably sized types like strings. If all 
string values are unique then it is potentially worse than just storing the 
string arrays directly.
   
   we also don't actually seem to be taking advantage of the what I would think 
would be the real situation to use this approach, which is in the cases where 
an underlying multi-value string column is being used that is already 
dictionary encoded, we could use it's dictionary ids directly instead of making 
a new lookup here. In that special case, where grouping on a multi-value string 
column from a segment as an array, I can maybe see it being advantageous to 
just convert the `IndexedInts` to an array of ints directly, so you can defer 
ever looking up the string value for the dictionary id until it is needed. But, 
that _only_ works when the underlying type is dictionary encoded, which isn't 
always the case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on a change in pull request #12078: Grouping on arrays as arrays

Reply via email to