Re: [PR] Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results (druid)

via GitHub Tue, 09 Jul 2024 23:19:31 -0700


clintropolis commented on code in PR #16620:
URL: https://github.com/apache/druid/pull/16620#discussion_r1671675273



##########
processing/src/main/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChest.java:
##########
@@ -727,13 +716,33 @@ public ResultRow apply(Object input)
             int dimPos = 0;
             while (dimsIter.hasNext() && results.hasNext()) {
               final DimensionSpec dimensionSpec = dimsIter.next();
-
-              // Must convert generic Jackson-deserialized type into the 
proper type.
-              resultRow.set(
-                  dimensionStart + dimPos,
-                  DimensionHandlerUtils.convertObjectToType(results.next(), 
dimensionSpec.getOutputType())
-              );
-
+              final Object dimensionObject = results.next();
+              final Object dimensionObjectCasted;
+
+              final ColumnType outputType = dimensionSpec.getOutputType();
+
+              // Must convert generic Jackson-deserialized type into the 
proper type. The downstream functions expect the
+              // dimensions to be of appropriate types for further processing 
like merging and comparing.
+              if (outputType.is(ValueType.COMPLEX)) {
+                // Json columns can interpret generic data objects 
appropriately, hence they are wrapped as is in StructuredData.
+                // They don't need to converted them from Object.class to 
StructuredData.class using object mapper as that is an
+                // expensive operation that will be wasteful.
+                if (outputType.equals(ColumnType.NESTED_DATA)) {
+                  dimensionObjectCasted = StructuredData.wrap(dimensionObject);
+                } else {
+                  DruidException.conditionalDefensive(
+                      mapper != null,
+                      "Cannot deserialize complex dimension from if object 
mapper is not provided"
+                  );

Review Comment:
   nit: i wonder if this should be outside of the loop or detected outside of 
it at least before we get here? like maybe while doing `createDimensionClasses` 
or something



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1371,6 +1361,69 @@ public Grouper.BufferComparator 
bufferComparatorWithAggregators(
       );
     }
 
+    @Override
+    public ObjectMapper decorateObjectMapper(ObjectMapper spillMapper)
+    {
+
+      final JsonDeserializer<RowBasedKey> deserializer = new 
JsonDeserializer<RowBasedKey>()
+      {
+        @Override
+        public RowBasedKey deserialize(
+            JsonParser jp,
+            DeserializationContext deserializationContext
+        ) throws IOException
+        {
+          if (!jp.isExpectedStartArrayToken()) {
+            throw DruidException.defensive("Expected array start token, 
received [%s]", jp.getCurrentToken());
+          }
+          jp.nextToken();
+
+          final ObjectCodec codec = jp.getCodec();
+          final int timestampAdjustment = includeTimestamp ? 1 : 0;
+          final int dimsToRead = timestampAdjustment + serdeHelpers.length;
+          int dimsReadSoFar = 0;
+          final Object[] objects = new Object[dimsToRead];
+
+          while (jp.currentToken() != JsonToken.END_ARRAY) {
+            if (dimsReadSoFar >= dimsToRead) {
+              throw DruidException.defensive("More dimensions encountered than 
expected [%d]", dimsToRead);
+            }
+
+            if (includeTimestamp && dimsReadSoFar == 0) {
+              // Read the timestamp
+              objects[dimsReadSoFar] = codec.readValue(jp, Long.class);
+            } else {
+              DruidException.conditionalDefensive(
+                  dimsReadSoFar - timestampAdjustment < serdeHelpers.length,
+                  "Insufficient serde helpers present"
+              );
+              // Read the dimension
+              serdeHelpers[dimsReadSoFar - timestampAdjustment].getClazz();
+              objects[dimsReadSoFar] =
+                  codec.readValue(jp, serdeHelpers[dimsReadSoFar - 
timestampAdjustment].getClazz());
+            }
+
+            ++dimsReadSoFar;
+            jp.nextToken();
+          }
+
+          return new RowBasedKey(objects);
+        }
+      };
+
+      class SpillModule extends SimpleModule
+      {
+        public SpillModule()
+        {
+          addDeserializer(RowBasedKey.class, deserializer);
+        }
+      }
+
+      final ObjectMapper newObjectMapper = spillMapper.copy();
+      newObjectMapper.registerModule(new SpillModule());

Review Comment:
   oh right, i can't read 😜 , sorry



##########
processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/RowBasedGrouperHelper.java:
##########
@@ -1726,6 +1791,12 @@ public Object2IntMap<Object[]> getReverseDictionary()
       {
         return reverseDictionary;
       }
+
+      @Override
+      public Class<?> getClazz()
+      {
+        return Object.class;

Review Comment:
   i guess i was wondering if maybe could help get rid of some of the coerce 
list to array stuff, but totally fine to explore this later



##########
processing/src/test/java/org/apache/druid/query/groupby/GroupByQueryQueryToolChestTest.java:
##########
@@ -97,6 +98,8 @@ public class GroupByQueryQueryToolChestTest extends 
InitializedNullHandlingTest
   public static void setUpClass()
   {
     NullHandling.initializeForTests();
+    //noinspection ResultOfObjectAllocationIgnored
+    new AggregatorsModule();

Review Comment:
   nit: forget about this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Deserialize complex dimensions in group by queries to their respective types when reading from spilled files and cached results (druid)

Reply via email to