Re: [PR] Grouping on complex columns aka unifying GroupBy strategies (druid)

via GitHub Fri, 19 Apr 2024 03:26:17 -0700


cryptoe commented on code in PR #16068:
URL: https://github.com/apache/druid/pull/16068#discussion_r1572162264



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/column/DictionaryBuildingGroupByColumnSelectorStrategy.java:
##########
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.groupby.epinephelinae.column;
+
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+import it.unimi.dsi.fastutil.objects.Object2IntMap;
+import org.apache.druid.error.DruidException;
+import org.apache.druid.query.groupby.epinephelinae.DictionaryBuildingUtils;
+import org.apache.druid.segment.column.ColumnType;
+import org.apache.druid.segment.column.NullableTypeStrategy;
+
+import javax.annotation.concurrent.NotThreadSafe;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Strategy for grouping dimensions which can have variable-width objects, and 
aren't backed by prebuilt dictionaries. It
+ * encapsulates the dictionary building logic, along with providing the 
implementations for dimension to dictionary id
+ * encoding-decoding.
+ * <p>
+ * This strategy can handle any dimension that can be addressed on a 
reverse-dictionary. Reverse dictionary uses
+ * a sorted map, rather than a hashmap.
+ * <p>
+ * This is the most expensive of all the strategies, and hence must be used 
only when other strategies aren't valid.
+ */
+@NotThreadSafe
+public class DictionaryBuildingGroupByColumnSelectorStrategy<DimensionType>
+    extends KeyMappingGroupByColumnSelectorStrategy<DimensionType>
+{
+
+  private DictionaryBuildingGroupByColumnSelectorStrategy(
+      DimensionIdCodec<DimensionType> dimensionIdCodec,
+      ColumnType columnType,
+      NullableTypeStrategy<DimensionType> nullableTypeStrategy,
+      DimensionType defaultValue
+  )
+  {
+    super(dimensionIdCodec, columnType, nullableTypeStrategy, defaultValue);
+  }
+
+  /**
+   * Creates an implementation of the strategy for the given type
+   */
+  public static GroupByColumnSelectorStrategy forType(final ColumnType 
columnType)
+  {
+    if (columnType.equals(ColumnType.STRING)) {
+      // String types are handled specially because they can have multi-value 
dimensions
+      throw DruidException.defensive("Should use special variant which handles 
multi-value dimensions");
+    } else if (
+      // Defensive check, primitives should be using a faster fixed-width 
strategy
+        columnType.equals(ColumnType.DOUBLE)
+        || columnType.equals(ColumnType.FLOAT)
+        || columnType.equals(ColumnType.LONG)) {
+      throw DruidException.defensive("Could used a fixed width strategy");
+    }
+
+    if (ColumnType.STRING_ARRAY.equals(columnType)) {
+      forStringArrays();
+    }
+
+    // Catch-all for all other types, that can only have single-valued 
dimensions
+    return forArrayAndComplexTypes(columnType);
+  }
+
+  /**
+   * Implemenatation of dictionary building strategy for types other than 
strings (since they can be multi-valued and need
+   * to be handled separately) and numeric primitives (since they can be 
handled by fixed-width strategy).
+   * This also means that we handle array and complex types here, which 
simplifies the generics a lot, as everything can be
+   * treated as Object in this class.
+   * <p>
+   * Also, there isn't any concept of multi-values here, therefore Dimension 
== DimensionHolderType == Object. We still
+   * homogenize rogue selectors which can return non-standard implementation 
of arrays (like Long[] for long arrays instead of
+   * Object[]) to what the callers would expect (i.e. Object[] in this case).
+   */
+  private static GroupByColumnSelectorStrategy forArrayAndComplexTypes(final 
ColumnType columnType)
+  {
+    final List<Object> dictionary = DictionaryBuildingUtils.createDictionary();
+    final Object2IntMap<Object> reverseDictionary =
+        
DictionaryBuildingUtils.createReverseDictionary(columnType.getNullableStrategy());
+    return new DictionaryBuildingGroupByColumnSelectorStrategy<>(
+        new UniValueDimensionIdCodec(dictionary, reverseDictionary, 
columnType.getNullableStrategy()),
+        columnType,
+        columnType.getNullableStrategy(),
+        null
+    );
+  }
+
+  private static GroupByColumnSelectorStrategy forStringArrays()
+  {
+    final BiMap<String, Integer> elementBiDictionary = HashBiMap.create();
+    final BiMap<ArrayList<Integer>, Integer> arrayBiDictionary = 
HashBiMap.create();
+    return new DictionaryBuildingGroupByColumnSelectorStrategy<>(
+        new StringArrayDimensionIdCodec(elementBiDictionary, 
arrayBiDictionary),
+        ColumnType.STRING_ARRAY,
+        ColumnType.STRING_ARRAY.getNullableStrategy(),
+        null
+    );
+  }
+
+  private static class UniValueDimensionIdCodec implements 
DimensionIdCodec<Object>
+  {
+    /**
+     * Dictionary for mapping the dimension value to an index. i-th position 
in the dictionary holds the value represented
+     * by the dictionaryId "i".
+     * Therefore, if a value has a dictionary id "i", dictionary.get(i) = value
+     * If a -> 0, b -> 1, c -> 2 (value -> dictionaryId), then the dictionary 
would be laid out like: [a, b, c]
+     */
+    private final List<Object> dictionary;
+
+    /**
+     * Reverse dictionary for faster lookup into the dictionary, and reusing 
pre-existing dictionary ids.
+     * <p>
+     * An entry of form (value, i) in the reverse dictionary represents that 
"value" is present at the i-th location in the
+     * {@link #dictionary}.
+     * Absence of mapping of a "value" (denoted by returning {@link 
GroupByColumnSelectorStrategy#GROUP_BY_MISSING_VALUE})
+     * represents that the value is absent in the dictionary
+     * If a -> 0, b -> 1, c -> 2 (value -> dictionaryId), then the reverse 
dictionary would have the entries (a, 0), (b, 1),
+     * (c, 2)
+     */
+    private final Object2IntMap<Object> reverseDictionary;
+
+    @SuppressWarnings("rawtypes")
+    private final NullableTypeStrategy nullableTypeStrategy;
+
+    public UniValueDimensionIdCodec(
+        final List<Object> dictionary,
+        final Object2IntMap<Object> reverseDictionary,
+        final NullableTypeStrategy nullableTypeStrategy
+    )
+    {
+      this.dictionary = dictionary;
+      this.reverseDictionary = reverseDictionary;

Review Comment:
   I feel the objects should be owned by the code since we are resetting them. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Grouping on complex columns aka unifying GroupBy strategies (druid)

Reply via email to