Re: [PR] "Partition boost" the group by queries in MSQ for better splits (druid)

via GitHub Tue, 02 Jan 2024 02:27:46 -0800


cryptoe commented on code in PR #15474:
URL: https://github.com/apache/druid/pull/15474#discussion_r1439309345



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/groupby/GroupByPostShuffleFrameProcessor.java:
##########
@@ -233,13 +242,19 @@ private boolean writeOutputRow() throws IOException
     finalizeFn.accept(outputRow);
 
     if (frameWriter.addSelection()) {
+      if (partitionBoostVirtualColumn != null) {
+        
partitionBoostVirtualColumn.setValue(partitionBoostVirtualColumn.getValue() + 
1);
+      }
       outputRow = null;
       return false;
     } else if (frameWriter.getNumRows() > 0) {
       writeCurrentFrameIfNeeded();
       setUpFrameWriterIfNeeded();
 
       if (frameWriter.addSelection()) {
+        if (partitionBoostVirtualColumn != null) {
+          
partitionBoostVirtualColumn.setValue(partitionBoostVirtualColumn.getValue() + 
1);

Review Comment:
   Lets collapse this method into incrementPartitionVirtualColumnIfNeeded()



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/groupby/GroupByQueryKit.java:
##########
@@ -99,46 +99,45 @@ public QueryDefinition makeQueryDefinition(
         QueryKitUtils.getSegmentGranularityFromContext(jsonMapper, 
queryToRun.getContext());
     final RowSignature intermediateSignature = 
computeIntermediateSignature(queryToRun);
     final ClusterBy resultClusterByWithoutGranularity = 
computeClusterByForResults(queryToRun);
-    final ClusterBy resultClusterBy =
+    final ClusterBy resultClusterByWithoutPartitionBoost =
         
QueryKitUtils.clusterByWithSegmentGranularity(resultClusterByWithoutGranularity,
 segmentGranularity);
-    final RowSignature resultSignature =
-        QueryKitUtils.sortableSignature(
-            
QueryKitUtils.signatureWithSegmentGranularity(computeResultSignature(queryToRun),
 segmentGranularity),
-            resultClusterBy.getColumns()
-        );
     final ClusterBy intermediateClusterBy = 
computeIntermediateClusterBy(queryToRun);
-    final boolean doOrderBy = !resultClusterBy.equals(intermediateClusterBy);
+    final boolean doOrderBy = 
!resultClusterByWithoutPartitionBoost.equals(intermediateClusterBy);
     final boolean doLimitOrOffset =
         queryToRun.getLimitSpec() instanceof DefaultLimitSpec
         && (((DefaultLimitSpec) queryToRun.getLimitSpec()).isLimited()
             || ((DefaultLimitSpec) queryToRun.getLimitSpec()).isOffset());
 
     final ShuffleSpecFactory shuffleSpecFactoryPreAggregation;
     final ShuffleSpecFactory shuffleSpecFactoryPostAggregation;
+    boolean partitionBoosted;
 
-    // There can be a situation where intermediateClusterBy is empty, while 
the result is non-empty
-    // if we have PARTITIONED BY on anything except ALL, however we don't have 
a grouping dimension
-    // (i.e. no GROUP BY clause)
-    // __time in such queries is generated using either an aggregator (e.g. 
sum(metric) as __time) or using a
-    // post-aggregator (e.g. TIMESTAMP '2000-01-01' as __time)
-    if (intermediateClusterBy.isEmpty() && resultClusterBy.isEmpty()) {
+    if (intermediateClusterBy.isEmpty() && 
resultClusterByWithoutPartitionBoost.isEmpty()) {

Review Comment:
   Lets add an documentation here for this. 
   `Insert into foo select count(*) from bar partitioned by ALL.` 



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/groupby/GroupByPostShuffleFrameProcessor.java:
##########
@@ -233,13 +242,19 @@ private boolean writeOutputRow() throws IOException
     finalizeFn.accept(outputRow);
 
     if (frameWriter.addSelection()) {
+      if (partitionBoostVirtualColumn != null) {
+        
partitionBoostVirtualColumn.setValue(partitionBoostVirtualColumn.getValue() + 
1);

Review Comment:
   Will this cause the compare function on L161 to behave weirdly ?



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/groupby/GroupByQueryKit.java:
##########
@@ -151,25 +150,45 @@ public QueryDefinition makeQueryDefinition(
                        .processorFactory(new 
GroupByPreShuffleFrameProcessorFactory(queryToRun))
     );
 
+    List<KeyColumn> resultClusterByWithPartitionBoostColumns = new 
ArrayList<>(resultClusterByWithoutPartitionBoost.getColumns());
+    resultClusterByWithPartitionBoostColumns.add(new 
KeyColumn(QueryKitUtils.PARTITION_BOOST_COLUMN, KeyOrder.ASCENDING));
+    ClusterBy resultClusterByWithPartitionBoost = new ClusterBy(
+        resultClusterByWithPartitionBoostColumns,
+        resultClusterByWithoutPartitionBoost.getBucketByCount()

Review Comment:
   I think line 153 : 167 can be another method called createRowSignature. 
   Also the if on line 159 should be the first switch in the control flow. 
Something like 
   ```
   SigX=null; 
   if(boosted)
   {
   sigX=foo
   }else{
   sigX=bar
   }
   ```



##########
extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/querykit/groupby/GroupByQueryKit.java:
##########
@@ -99,46 +99,45 @@ public QueryDefinition makeQueryDefinition(
         QueryKitUtils.getSegmentGranularityFromContext(jsonMapper, 
queryToRun.getContext());
     final RowSignature intermediateSignature = 
computeIntermediateSignature(queryToRun);
     final ClusterBy resultClusterByWithoutGranularity = 
computeClusterByForResults(queryToRun);
-    final ClusterBy resultClusterBy =
+    final ClusterBy resultClusterByWithoutPartitionBoost =
         
QueryKitUtils.clusterByWithSegmentGranularity(resultClusterByWithoutGranularity,
 segmentGranularity);
-    final RowSignature resultSignature =
-        QueryKitUtils.sortableSignature(
-            
QueryKitUtils.signatureWithSegmentGranularity(computeResultSignature(queryToRun),
 segmentGranularity),
-            resultClusterBy.getColumns()
-        );
     final ClusterBy intermediateClusterBy = 
computeIntermediateClusterBy(queryToRun);
-    final boolean doOrderBy = !resultClusterBy.equals(intermediateClusterBy);
+    final boolean doOrderBy = 
!resultClusterByWithoutPartitionBoost.equals(intermediateClusterBy);
     final boolean doLimitOrOffset =
         queryToRun.getLimitSpec() instanceof DefaultLimitSpec
         && (((DefaultLimitSpec) queryToRun.getLimitSpec()).isLimited()
             || ((DefaultLimitSpec) queryToRun.getLimitSpec()).isOffset());
 
     final ShuffleSpecFactory shuffleSpecFactoryPreAggregation;
     final ShuffleSpecFactory shuffleSpecFactoryPostAggregation;
+    boolean partitionBoosted;
 
-    // There can be a situation where intermediateClusterBy is empty, while 
the result is non-empty
-    // if we have PARTITIONED BY on anything except ALL, however we don't have 
a grouping dimension
-    // (i.e. no GROUP BY clause)
-    // __time in such queries is generated using either an aggregator (e.g. 
sum(metric) as __time) or using a
-    // post-aggregator (e.g. TIMESTAMP '2000-01-01' as __time)
-    if (intermediateClusterBy.isEmpty() && resultClusterBy.isEmpty()) {
+    if (intermediateClusterBy.isEmpty() && 
resultClusterByWithoutPartitionBoost.isEmpty()) {
       // Ignore shuffleSpecFactory, since we know only a single partition will 
come out, and we can save some effort.
       shuffleSpecFactoryPreAggregation = 
ShuffleSpecFactories.singlePartition();
       shuffleSpecFactoryPostAggregation = 
ShuffleSpecFactories.singlePartition();
+      partitionBoosted = false;
     } else if (doOrderBy) {
+      // There can be a situation where intermediateClusterBy is empty, while 
the result is non-empty

Review Comment:
   ```suggestion
         // There can be a situation where intermediateClusterBy is empty, 
while the resultClusterBy is non-empty
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] "Partition boost" the group by queries in MSQ for better splits (druid)

Reply via email to