Re: [PR] Aggregations Support `Partitioning::Range` [datafusion]

via GitHub Wed, 01 Jul 2026 12:18:33 -0700


gene-bordegaray commented on code in PR #23239:
URL: https://github.com/apache/datafusion/pull/23239#discussion_r3508497075



##########
datafusion/sqllogictest/test_files/range_partitioning.slt:
##########
@@ -79,7 +87,182 @@ SELECT non_range_key, SUM(value) FROM range_partitioned 
GROUP BY non_range_key O
 
 
 ##########
-# TEST 3: Join on Range Partition Column
+# TEST 3: Aggregate Reuses Range Subset Partitioning
+# With subset threshold met and preserve-file disabled, Range([range_key])
+# satisfies grouping by (range_key, non_range_key).
+##########
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 4;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, non_range_key, SUM(value) FROM range_partitioned 
GROUP BY range_key, non_range_key;
+----
+physical_plan
+01)AggregateExec: mode=SinglePartitioned, gby=[range_key@0 as range_key, 
non_range_key@1 as non_range_key], aggr=[sum(range_partitioned.value)]
+02)--DataSourceExec: file_groups={4 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-0.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-1.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-2.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-3.csv]]},
 projection=[range_key, non_range_key, value], 
output_partitioning=Range([range_key@0 ASC], [(10), (20), (30)], 4), 
file_type=csv, has_header=false
+
+query III
+SELECT range_key, non_range_key, SUM(value) FROM range_partitioned GROUP BY 
range_key, non_range_key ORDER BY range_key, non_range_key;
+----
+1 1 10
+5 2 50
+10 1 100
+15 2 150
+20 1 200
+25 2 250
+30 1 300
+35 2 350
+
+
+##########
+# TEST 4: Exact Range Aggregate Below Subset Threshold
+# Even when subset satisfaction is disabled, exact Range([range_key])
+# satisfies GROUP BY range_key when repartitioning would not increase
+# partition count.
+##########
+
+statement ok
+set datafusion.execution.target_partitions = 4;
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 5;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, SUM(value) FROM range_partitioned GROUP BY range_key;

Review Comment:
   I didnt run these because the physical plan is the same as test 1, is the 
reason to duplicate to see if changes in physical plan result in a query result 
mismatch?



##########
datafusion/sqllogictest/test_files/range_partitioning.slt:
##########
@@ -79,7 +87,182 @@ SELECT non_range_key, SUM(value) FROM range_partitioned 
GROUP BY non_range_key O
 
 
 ##########
-# TEST 3: Join on Range Partition Column
+# TEST 3: Aggregate Reuses Range Subset Partitioning
+# With subset threshold met and preserve-file disabled, Range([range_key])
+# satisfies grouping by (range_key, non_range_key).
+##########
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 4;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, non_range_key, SUM(value) FROM range_partitioned 
GROUP BY range_key, non_range_key;
+----
+physical_plan
+01)AggregateExec: mode=SinglePartitioned, gby=[range_key@0 as range_key, 
non_range_key@1 as non_range_key], aggr=[sum(range_partitioned.value)]
+02)--DataSourceExec: file_groups={4 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-0.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-1.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-2.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-3.csv]]},
 projection=[range_key, non_range_key, value], 
output_partitioning=Range([range_key@0 ASC], [(10), (20), (30)], 4), 
file_type=csv, has_header=false
+
+query III
+SELECT range_key, non_range_key, SUM(value) FROM range_partitioned GROUP BY 
range_key, non_range_key ORDER BY range_key, non_range_key;
+----
+1 1 10
+5 2 50
+10 1 100
+15 2 150
+20 1 200
+25 2 250
+30 1 300
+35 2 350
+
+
+##########
+# TEST 4: Exact Range Aggregate Below Subset Threshold
+# Even when subset satisfaction is disabled, exact Range([range_key])
+# satisfies GROUP BY range_key when repartitioning would not increase
+# partition count.
+##########
+
+statement ok
+set datafusion.execution.target_partitions = 4;
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 5;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, SUM(value) FROM range_partitioned GROUP BY range_key;
+----
+physical_plan
+01)AggregateExec: mode=SinglePartitioned, gby=[range_key@0 as range_key], 
aggr=[sum(range_partitioned.value)]
+02)--DataSourceExec: file_groups={4 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-0.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-1.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-2.csv],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-3.csv]]},
 projection=[range_key, value], output_partitioning=Range([range_key@0 ASC], 
[(10), (20), (30)], 4), file_type=csv, has_header=false
+
+
+##########
+# TEST 5: Range Subset Aggregate Rehashes Below Subset Threshold
+# Range([range_key]) is only a subset of GROUP BY (range_key, non_range_key),
+# so it should not satisfy the aggregate key when subset satisfaction is
+# disabled.
+##########
+
+statement ok
+set datafusion.execution.target_partitions = 4;
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 5;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, non_range_key, SUM(value) FROM range_partitioned 
GROUP BY range_key, non_range_key;

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Aggregations Support `Partitioning::Range` [datafusion]

Reply via email to