[GitHub] [druid] jihoonson commented on a change in pull request #11379: improve groupBy query granularity translation with 2x query performance improve when issued from sql layer

GitBox Thu, 01 Jul 2021 22:44:06 -0700


jihoonson commented on a change in pull request #11379:
URL: https://github.com/apache/druid/pull/11379#discussion_r662744241




##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/orderby/DefaultLimitSpec.java
##########
@@ -220,6 +221,15 @@ public boolean isLimited()
       sortingNeeded = !query.getGranularity().equals(Granularities.ALL) && 
query.getContextSortByDimsFirst();
     }
 
+    if (!sortingNeeded) {
+      Map<String, Object> timestampFieldContext = 
GroupByQueryHelper.findTimestampResultField(query);

Review comment:
       Why not finding the index from the query context?

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/strategy/GroupByStrategyV2.java
##########
@@ -213,7 +216,48 @@ public boolean doMergeResults(final GroupByQuery query)
     context.put("finalize", false);
     context.put(GroupByQueryConfig.CTX_KEY_STRATEGY, 
GroupByStrategySelector.STRATEGY_V2);
     context.put(CTX_KEY_OUTERMOST, false);
-    if (query.getUniversalTimestamp() != null) {
+    Map<String, Object> timestampFieldContext = 
GroupByQueryHelper.findTimestampResultField(query);
+    context.putAll(timestampFieldContext);
+
+    Granularity granularity = query.getGranularity();
+    List<DimensionSpec> dimensionSpecs = query.getDimensions();
+    final String timestampResultField = (String) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD);
+    final boolean hasTimestampResultField = timestampResultField != null
+                                            && 
query.getContextBoolean(CTX_KEY_OUTERMOST, true);
+    int timestampResultFieldIndex = 0;
+    if (hasTimestampResultField) {
+      // sql like "group by city_id,time_floor(__time to day)",
+      // the original translated query is granularity=all and dimensions:[d0, 
d1]
+      // the better plan is granularity=day and dimensions:[d0]
+      // but the ResultRow structure is changed from [d0, d1] to [__time, d0]
+      // this structure should be fixed as [d0, d1] (actually it is [d0, 
__time]) before postAggs are called
+      final Granularity timestampResultFieldGranularity
+          = (Granularity) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_GRANULARITY);
+      dimensionSpecs =
+          query.getDimensions()
+               .stream()
+               .filter(dimensionSpec -> 
!dimensionSpec.getOutputName().equals(timestampResultField))
+               .collect(Collectors.toList());
+      granularity = timestampResultFieldGranularity;
+      // when timestampResultField is the last dimension, should set 
sortByDimsFirst=true,
+      // otherwise the downstream is sorted by row's timestamp first which 
makes the final ordering not as expected
+      timestampResultFieldIndex = (int) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_INDEX);
+      if (!query.getContextSortByDimsFirst() && timestampResultFieldIndex == 
query.getDimensions().size() - 1) {
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, true);
+      }
+      // when timestampResultField is the first dimension and 
sortByDimsFirst=true,
+      // it is actually equals to sortByDimsFirst=false
+      if (query.getContextSortByDimsFirst() && timestampResultFieldIndex == 0) 
{
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, false);
+      }
+      // when hasTimestampResultField=true and timestampResultField is neither 
first nor last dimension,
+      // the DefaultLimitSpec will always do the reordering
+    }

Review comment:
       The Druid native query is the execution plan of the sql query. To me, it 
makes more sense to compute the granularity, dimensions, sortByDimsFirst, etc, 
and set them in the groupBy query in the sql planner, because the query engine 
should process whatever the given query plan as it is. This way, we can have 
one single place that optimizes the query plan which is easier to maintain and 
improve. Suppose we have multiple different places that modify the given query 
separately. It will be very hard to track what query will be actually executed.

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/orderby/DefaultLimitSpec.java
##########
@@ -220,6 +221,15 @@ public boolean isLimited()
       sortingNeeded = !query.getGranularity().equals(Granularities.ALL) && 
query.getContextSortByDimsFirst();
     }
 
+    if (!sortingNeeded) {
+      Map<String, Object> timestampFieldContext = 
GroupByQueryHelper.findTimestampResultField(query);
+      if (!timestampFieldContext.isEmpty()) {
+        int timestampResultFieldIndex = (int) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_INDEX);
+        // change the sorting order when timestampResultField is neither first 
nor last dimension
+        sortingNeeded = timestampResultFieldIndex != 0 && 
timestampResultFieldIndex != query.getDimensions().size() - 1;

Review comment:
       Should this depend on `query.getContextSortByDimsFirst()`? Such as 
   
   ```
           sortingNeeded = query.getContextSortByDimsFirst()
                           ? timestampResultFieldIndex != 
query.getDimensions().size() - 1
                           : timestampResultFieldIndex != 0;
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jihoonson commented on a change in pull request #11379: improve groupBy query granularity translation with 2x query performance improve when issued from sql layer

Reply via email to