Re: [PR] Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results (druid)

via GitHub Wed, 03 Jul 2024 02:31:35 -0700


clintropolis commented on code in PR #16620:
URL: https://github.com/apache/druid/pull/16620#discussion_r1663839764



##########
processing/src/main/java/org/apache/druid/query/datasourcemetadata/DataSourceQueryQueryToolChest.java:
##########
@@ -119,4 +121,10 @@ public CacheStrategy 
getCacheStrategy(DataSourceMetadataQuery query)
   {
     return null;
   }
+
+  @Override
+  public CacheStrategy getCacheStrategy(DataSourceMetadataQuery query, 
@Nullable ObjectMapper mapper)
+  {
+    return null;

Review Comment:
   nit: if this is returning null is the override really needed?



##########
processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChest.java:
##########
@@ -861,4 +856,27 @@ private static BitSet extractionsToRewrite(GroupByQuery 
query)
 
     return retVal;
   }
+
+  private static Class<?>[] createDimensionClasses(final GroupByQuery query)
+  {
+    final List<DimensionSpec> queryDimensions = query.getDimensions();
+    final Class<?>[] classes = new Class[queryDimensions.size()];
+    for (int i = 0; i < queryDimensions.size(); ++i) {
+      final ColumnType dimensionOutputType = 
queryDimensions.get(i).getOutputType();
+      if (dimensionOutputType.is(ValueType.COMPLEX)) {
+        NullableTypeStrategy nullableTypeStrategy = 
dimensionOutputType.getNullableStrategy();
+        if (!nullableTypeStrategy.groupable()) {
+          throw DruidException.defensive(
+              "Ungroupable dimension [%s] with type [%s] found in the query.",
+              queryDimensions.get(i).getDimension(),
+              dimensionOutputType
+          );
+        }
+        classes[i] = nullableTypeStrategy.getClazz();
+      } else {
+        classes[i] = Object.class;

Review Comment:
   any reason not to use the real class here?



##########
processing/src/main/java/org/apache/druid/query/QueryToolChest.java:
##########
@@ -251,19 +251,30 @@ public Function<ResultType, ResultType> 
makePostComputeManipulatorFn(QueryType q
    */
   public abstract TypeReference<ResultType> getResultTypeReference();
 
+  /**
+   *

Review Comment:
   nit: please add javadocs or remove (but preferably add and what does null 
mean if it returns null like the other one has).
   
   Also, should this be deprecated?



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1371,6 +1361,77 @@ public Grouper.BufferComparator 
bufferComparatorWithAggregators(
       );
     }
 
+    @Override
+    public ObjectMapper decorateObjectMapper(ObjectMapper spillMapper)
+    {
+
+      final JsonDeserializer<RowBasedKey> deserializer = new 
JsonDeserializer<RowBasedKey>()
+      {
+        @Override
+        public RowBasedKey deserialize(
+            JsonParser jp,
+            DeserializationContext deserializationContext
+        ) throws IOException
+        {
+          if (!jp.isExpectedStartArrayToken()) {
+            throw DruidException.defensive("Expected array start token, 
received [%s]", jp.getCurrentToken());
+          }
+          jp.nextToken();
+
+          final ObjectCodec codec = jp.getCodec();
+          final int timestampAdjustment = includeTimestamp ? 1 : 0;
+          final int dimsToRead = timestampAdjustment + serdeHelpers.length;
+          int dimsReadSoFar = 0;
+          final Object[] objects = new Object[dimsToRead];
+
+          while (jp.currentToken() != JsonToken.END_ARRAY) {
+            if (dimsReadSoFar >= dimsToRead) {
+              throw DruidException.defensive("More dimensions encountered than 
expected [%d]", dimsToRead);
+            }
+
+            if (includeTimestamp && dimsReadSoFar == 0) {
+              // Read the timestamp
+              objects[dimsReadSoFar] = codec.readValue(jp, Long.class);
+            } else {

Review Comment:
   i guess my thought was that we would be removing one conditional check of 
the loop of every column of every row which could add up, but i haven't really 
measured one way or another. This is a pretty hot loop though so worth thinking 
about making it as efficient as possible



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1613,6 +1667,11 @@ public GenericRowBasedKeySerdeHelper(
                 dictionary.get(lhsBuffer.getInt(lhsPosition + 
keyBufferPosition)),
                 dictionary.get(rhsBuffer.getInt(rhsPosition + 
keyBufferPosition))
             );
+        if (columnType.is(ValueType.COMPLEX)) {
+          clazz = columnType.getNullableStrategy().getClazz();
+        } else {
+          clazz = Object.class;

Review Comment:
   why not always use the column type?



##########
processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChest.java:
##########
@@ -727,13 +715,20 @@ public ResultRow apply(Object input)
             int dimPos = 0;
             while (dimsIter.hasNext() && results.hasNext()) {
               final DimensionSpec dimensionSpec = dimsIter.next();
+              final Object dimensionObject = results.next();
+              final Object dimensionObjectCasted;
 
               // Must convert generic Jackson-deserialized type into the 
proper type.
-              resultRow.set(
-                  dimensionStart + dimPos,
-                  DimensionHandlerUtils.convertObjectToType(results.next(), 
dimensionSpec.getOutputType())
-              );
-
+              if (dimensionSpec.getOutputType().is(ValueType.COMPLEX)) {

Review Comment:
   i kind of wonder if we should special handle json (`COMPLEX<json>`) here 
since it is kind of a special complex type (that really should probably be a 
built-in standard type instead of complex...) because `convertValue` looks 
pretty expensive.



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1371,6 +1361,69 @@ public Grouper.BufferComparator 
bufferComparatorWithAggregators(
       );
     }
 
+    @Override
+    public ObjectMapper decorateObjectMapper(ObjectMapper spillMapper)
+    {
+
+      final JsonDeserializer<RowBasedKey> deserializer = new 
JsonDeserializer<RowBasedKey>()
+      {
+        @Override
+        public RowBasedKey deserialize(
+            JsonParser jp,
+            DeserializationContext deserializationContext
+        ) throws IOException
+        {
+          if (!jp.isExpectedStartArrayToken()) {
+            throw DruidException.defensive("Expected array start token, 
received [%s]", jp.getCurrentToken());
+          }
+          jp.nextToken();
+
+          final ObjectCodec codec = jp.getCodec();
+          final int timestampAdjustment = includeTimestamp ? 1 : 0;
+          final int dimsToRead = timestampAdjustment + serdeHelpers.length;
+          int dimsReadSoFar = 0;
+          final Object[] objects = new Object[dimsToRead];
+
+          while (jp.currentToken() != JsonToken.END_ARRAY) {
+            if (dimsReadSoFar >= dimsToRead) {
+              throw DruidException.defensive("More dimensions encountered than 
expected [%d]", dimsToRead);
+            }
+
+            if (includeTimestamp && dimsReadSoFar == 0) {
+              // Read the timestamp
+              objects[dimsReadSoFar] = codec.readValue(jp, Long.class);
+            } else {
+              DruidException.conditionalDefensive(
+                  dimsReadSoFar - timestampAdjustment < serdeHelpers.length,
+                  "Insufficient serde helpers present"
+              );
+              // Read the dimension
+              serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz();
+              objects[dimsReadSoFar] =
+                  codec.readValue(jp, serdeHelpers[dimsReadSoFar - 
timestampAdjustment].getClazz());
+            }
+
+            ++dimsReadSoFar;
+            jp.nextToken();
+          }
+
+          return new RowBasedKey(objects);
+        }
+      };
+
+      class SpillModule extends SimpleModule
+      {
+        public SpillModule()
+        {
+          addDeserializer(RowBasedKey.class, deserializer);
+        }
+      }
+
+      final ObjectMapper newObjectMapper = spillMapper.copy();
+      newObjectMapper.registerModule(new SpillModule());

Review Comment:
   should this be a static somewhere so we don't have to make a new one all the 
time?



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1726,6 +1791,12 @@ public Object2IntMap<Object[]> getReverseDictionary()
       {
         return reverseDictionary;
       }
+
+      @Override
+      public Class<?> getClazz()
+      {
+        return Object.class;

Review Comment:
   shouldn't this be `Object[].class`?



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1819,6 +1896,12 @@ public BufferComparator getBufferComparator()
       {
         return bufferComparator;
       }
+
+      @Override
+      public Class<?> getClazz()
+      {
+        return Object.class;

Review Comment:
   `String.class`?



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1770,6 +1841,12 @@ public Object2IntMap<Object[]> getReverseDictionary()
       {
         return reverseStringArrayDictionary;
       }
+
+      @Override
+      public Class<?> getClazz()
+      {
+        return Object.class;

Review Comment:
   same thing `Object[].class`



##########
processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChestTest.java:
##########
@@ -97,6 +98,8 @@ public class GroupByQueryQueryToolChestTest extends 
InitializedNullHandlingTest
   public static void setUpClass()
   {
     NullHandling.initializeForTests();
+    //noinspection ResultOfObjectAllocationIgnored
+    new AggregatorsModule();

Review Comment:
   hmm, we should move the complex serde registrations of that module to a 
static method it calls instead of doing this so its more obvious what is going 
on here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results (druid)

Reply via email to