[GitHub] [druid] kaijianding commented on a change in pull request #11379: improve groupBy query granularity translation with 2x query performance improve when issued from sql layer

GitBox Tue, 06 Jul 2021 04:05:33 -0700


kaijianding commented on a change in pull request #11379:
URL: https://github.com/apache/druid/pull/11379#discussion_r663808475




##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/strategy/GroupByStrategyV2.java
##########
@@ -213,7 +216,48 @@ public boolean doMergeResults(final GroupByQuery query)
     context.put("finalize", false);
     context.put(GroupByQueryConfig.CTX_KEY_STRATEGY, 
GroupByStrategySelector.STRATEGY_V2);
     context.put(CTX_KEY_OUTERMOST, false);
-    if (query.getUniversalTimestamp() != null) {
+    Map<String, Object> timestampFieldContext = 
GroupByQueryHelper.findTimestampResultField(query);
+    context.putAll(timestampFieldContext);
+
+    Granularity granularity = query.getGranularity();
+    List<DimensionSpec> dimensionSpecs = query.getDimensions();
+    final String timestampResultField = (String) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD);
+    final boolean hasTimestampResultField = timestampResultField != null
+                                            && 
query.getContextBoolean(CTX_KEY_OUTERMOST, true);
+    int timestampResultFieldIndex = 0;
+    if (hasTimestampResultField) {
+      // sql like "group by city_id,time_floor(__time to day)",
+      // the original translated query is granularity=all and dimensions:[d0, 
d1]
+      // the better plan is granularity=day and dimensions:[d0]
+      // but the ResultRow structure is changed from [d0, d1] to [__time, d0]
+      // this structure should be fixed as [d0, d1] (actually it is [d0, 
__time]) before postAggs are called
+      final Granularity timestampResultFieldGranularity
+          = (Granularity) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_GRANULARITY);
+      dimensionSpecs =
+          query.getDimensions()
+               .stream()
+               .filter(dimensionSpec -> 
!dimensionSpec.getOutputName().equals(timestampResultField))
+               .collect(Collectors.toList());
+      granularity = timestampResultFieldGranularity;
+      // when timestampResultField is the last dimension, should set 
sortByDimsFirst=true,
+      // otherwise the downstream is sorted by row's timestamp first which 
makes the final ordering not as expected
+      timestampResultFieldIndex = (int) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_INDEX);
+      if (!query.getContextSortByDimsFirst() && timestampResultFieldIndex == 
query.getDimensions().size() - 1) {
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, true);
+      }
+      // when timestampResultField is the first dimension and 
sortByDimsFirst=true,
+      // it is actually equals to sortByDimsFirst=false
+      if (query.getContextSortByDimsFirst() && timestampResultFieldIndex == 0) 
{
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, false);
+      }
+      // when hasTimestampResultField=true and timestampResultField is neither 
first nor last dimension,
+      // the DefaultLimitSpec will always do the reordering
+    }

Review comment:
       Ok, I have rollback some codes.
   Currently, timestampResultField is passed from query context in 
`DruidQuery.toGroupByQuery`.  
   The granularity=all and sortByDimsFirst are still unchanged in query 
context, because some code relies on the granularity and sortByDimsFirst like 
subtotals, in another word, the basic idea is still to treat this optimization 
as group by inner process improvement.
   Please check the latest version of code.

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/strategy/GroupByStrategyV2.java
##########
@@ -213,7 +216,48 @@ public boolean doMergeResults(final GroupByQuery query)
     context.put("finalize", false);
     context.put(GroupByQueryConfig.CTX_KEY_STRATEGY, 
GroupByStrategySelector.STRATEGY_V2);
     context.put(CTX_KEY_OUTERMOST, false);
-    if (query.getUniversalTimestamp() != null) {
+    Map<String, Object> timestampFieldContext = 
GroupByQueryHelper.findTimestampResultField(query);
+    context.putAll(timestampFieldContext);
+
+    Granularity granularity = query.getGranularity();
+    List<DimensionSpec> dimensionSpecs = query.getDimensions();
+    final String timestampResultField = (String) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD);
+    final boolean hasTimestampResultField = timestampResultField != null
+                                            && 
query.getContextBoolean(CTX_KEY_OUTERMOST, true);
+    int timestampResultFieldIndex = 0;
+    if (hasTimestampResultField) {
+      // sql like "group by city_id,time_floor(__time to day)",
+      // the original translated query is granularity=all and dimensions:[d0, 
d1]
+      // the better plan is granularity=day and dimensions:[d0]
+      // but the ResultRow structure is changed from [d0, d1] to [__time, d0]
+      // this structure should be fixed as [d0, d1] (actually it is [d0, 
__time]) before postAggs are called
+      final Granularity timestampResultFieldGranularity
+          = (Granularity) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_GRANULARITY);
+      dimensionSpecs =
+          query.getDimensions()
+               .stream()
+               .filter(dimensionSpec -> 
!dimensionSpec.getOutputName().equals(timestampResultField))
+               .collect(Collectors.toList());
+      granularity = timestampResultFieldGranularity;
+      // when timestampResultField is the last dimension, should set 
sortByDimsFirst=true,
+      // otherwise the downstream is sorted by row's timestamp first which 
makes the final ordering not as expected
+      timestampResultFieldIndex = (int) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_INDEX);
+      if (!query.getContextSortByDimsFirst() && timestampResultFieldIndex == 
query.getDimensions().size() - 1) {
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, true);
+      }
+      // when timestampResultField is the first dimension and 
sortByDimsFirst=true,
+      // it is actually equals to sortByDimsFirst=false
+      if (query.getContextSortByDimsFirst() && timestampResultFieldIndex == 0) 
{
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, false);
+      }
+      // when hasTimestampResultField=true and timestampResultField is neither 
first nor last dimension,
+      // the DefaultLimitSpec will always do the reordering
+    }

Review comment:
       Ok, I have rollback some codes.
   Currently, timestampResultField is passed from query context in 
`DruidQuery.toGroupByQuery`.  
   The granularity=all and sortByDimsFirst are still unchanged in `DruidQuery`, 
because some code relies on the granularity and sortByDimsFirst like subtotals, 
in another word, the basic idea is still to treat this optimization as group by 
inner process improvement.
   Please check the latest version of code.

##########
File path: 
processing/src/main/java/org/apache/druid/query/groupby/strategy/GroupByStrategyV2.java
##########
@@ -213,7 +216,48 @@ public boolean doMergeResults(final GroupByQuery query)
     context.put("finalize", false);
     context.put(GroupByQueryConfig.CTX_KEY_STRATEGY, 
GroupByStrategySelector.STRATEGY_V2);
     context.put(CTX_KEY_OUTERMOST, false);
-    if (query.getUniversalTimestamp() != null) {
+    Map<String, Object> timestampFieldContext = 
GroupByQueryHelper.findTimestampResultField(query);
+    context.putAll(timestampFieldContext);
+
+    Granularity granularity = query.getGranularity();
+    List<DimensionSpec> dimensionSpecs = query.getDimensions();
+    final String timestampResultField = (String) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD);
+    final boolean hasTimestampResultField = timestampResultField != null
+                                            && 
query.getContextBoolean(CTX_KEY_OUTERMOST, true);
+    int timestampResultFieldIndex = 0;
+    if (hasTimestampResultField) {
+      // sql like "group by city_id,time_floor(__time to day)",
+      // the original translated query is granularity=all and dimensions:[d0, 
d1]
+      // the better plan is granularity=day and dimensions:[d0]
+      // but the ResultRow structure is changed from [d0, d1] to [__time, d0]
+      // this structure should be fixed as [d0, d1] (actually it is [d0, 
__time]) before postAggs are called
+      final Granularity timestampResultFieldGranularity
+          = (Granularity) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_GRANULARITY);
+      dimensionSpecs =
+          query.getDimensions()
+               .stream()
+               .filter(dimensionSpec -> 
!dimensionSpec.getOutputName().equals(timestampResultField))
+               .collect(Collectors.toList());
+      granularity = timestampResultFieldGranularity;
+      // when timestampResultField is the last dimension, should set 
sortByDimsFirst=true,
+      // otherwise the downstream is sorted by row's timestamp first which 
makes the final ordering not as expected
+      timestampResultFieldIndex = (int) 
timestampFieldContext.get(GroupByQuery.CTX_TIMESTAMP_RESULT_FIELD_INDEX);
+      if (!query.getContextSortByDimsFirst() && timestampResultFieldIndex == 
query.getDimensions().size() - 1) {
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, true);
+      }
+      // when timestampResultField is the first dimension and 
sortByDimsFirst=true,
+      // it is actually equals to sortByDimsFirst=false
+      if (query.getContextSortByDimsFirst() && timestampResultFieldIndex == 0) 
{
+        context.put(GroupByQuery.CTX_KEY_SORT_BY_DIMS_FIRST, false);
+      }
+      // when hasTimestampResultField=true and timestampResultField is neither 
first nor last dimension,
+      // the DefaultLimitSpec will always do the reordering
+    }

Review comment:
       Ok, I have rollback some codes.
   Currently, timestampResultField is passed from query context in 
`DruidQuery.toGroupByQuery`.  
   The granularity=all and sortByDimsFirst are still unchanged in `DruidQuery. 
toGroupByQuery`, because some code relies on the granularity and 
sortByDimsFirst like subtotals, in another word, the basic idea is still to 
treat this optimization as group by inner process improvement.
   Please check the latest version of code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] kaijianding commented on a change in pull request #11379: improve groupBy query granularity translation with 2x query performance improve when issued from sql layer

Reply via email to