Re: [PR] MSQ: Write summary row for GROUP BY (). (druid)

via GitHub Fri, 08 Nov 2024 08:51:04 -0800


Akshat-Jain commented on code in PR #16326:
URL: https://github.com/apache/druid/pull/16326#discussion_r1833647434



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/groupby/GroupByQueryKit.java:
##########
@@ -148,7 +148,7 @@ public QueryDefinition makeQueryDefinition(
 
       partitionBoost = true;
     } else {
-      shuffleSpecFactoryPreAggregation = doLimitOrOffset
+      shuffleSpecFactoryPreAggregation = doLimitOrOffset || 
intermediateClusterBy.isEmpty()

Review Comment:
   Was this a bug that surfaced only when we had only summary rows?



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/groupby/GroupByPostShuffleFrameProcessor.java:
##########
@@ -263,6 +267,41 @@ private void writeCurrentFrameIfNeeded() throws IOException
       outputChannel.write(new FrameWithPartition(frame, 
FrameWithPartition.NO_PARTITION));
       frameWriter.close();
       frameWriter = null;
+      outputRows += frame.numRows();
+    }
+  }
+
+  /**
+   * Generate a frame containing a single row with default values of all 
aggregations if needed. This method uses
+   * {@link GroupingEngine#summaryRowPreconditions(GroupByQuery)} to determine 
if such an operation is needed.
+   *
+   * Note that in cases where {@link 
GroupingEngine#summaryRowPreconditions(GroupByQuery)} returns true, the
+   * preceding {@link GroupByPreShuffleFrameProcessorFactory} stage would use 
an empty {@link ClusterBy}. Therefore,
+   * there would only be a single output partition of the prior stage, and 
therefore a single instance of
+   * this processor. This ensures that only a single null-aggregations row is 
generated for the entire stage.
+   */
+  private void writeEmptyAggregationsFrameIfNeeded() throws IOException
+  {
+    // Check isIncludeNullResultRow, which is populated by the SQL planner 
when doing a query like GROUP BY ().

Review Comment:
   Which part of this logic corresponds to `isIncludeNullResultRow` condition? 
Trying to understand if it maps to something, or if this is an outdated comment 
by any chance.



##########
processing/src/main/java/org/apache/druid/query/groupby/GroupingEngine.java:
##########
@@ -985,7 +988,7 @@ private static boolean summaryRowPreconditions(GroupByQuery 
query)
         return false;
       }
     }
-    if (!query.getDimensions().isEmpty()) {
+    if (!query.getDimensions().isEmpty() || query.hasDroppedDimensions()) {

Review Comment:
   I'm trying to understand why this change is required. Can you maybe suggest 
some example query which gives unexpected output if we don't have this change?
   
   I tried a few queries with/without this change but unable to find a query 
that shows the difference.
   
   I do see that 
`CalciteQueryTest#testGroupByWithFilterMatchingNothingWithGroupByLiteral` fails 
without this change:
   ```
   2024-11-08T04:50:34,690 INFO [main] 
org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Expected results --
   ----
   2024-11-08T04:50:34,690 INFO [main] 
org.apache.druid.sql.calcite.BaseCalciteQueryTest - -- Actual results --
   new Object[]{0L, null}
   ----
   ```
   
   But I'm unable to get the same behavior when I try running a similar query 
locally: 
   ```sql
   SELECT COUNT(*) FROM wikipedia WHERE cityName = 'foobar' GROUP BY 'dummy'
   ```
   



##########
processing/src/main/java/org/apache/druid/query/timeseries/TimeseriesQueryQueryToolChest.java:
##########
@@ -231,32 +232,17 @@ public Comparator<Result<TimeseriesResultValue>> 
createResultComparator(Query<Re
     return ResultGranularTimestampComparator.create(query.getGranularity(), 
((TimeseriesQuery) query).isDescending());
   }
 
-  private Result<TimeseriesResultValue> 
getNullTimeseriesResultValue(TimeseriesQuery query)
+  /**
+   * Returns a {@link TimeseriesResultValue} that corresponds to an empty-set 
aggregation, which is used in situations
+   * where we want to return a single result representing "nothing was 
aggregated".
+   */
+  public Result<TimeseriesResultValue> 
getEmptyTimeseriesResultValue(TimeseriesQuery query)

Review Comment:
   Nit: Can be made `protected` instead of `public`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] MSQ: Write summary row for GROUP BY (). (druid)

Reply via email to