gene-bordegaray commented on code in PR #23239:
URL: https://github.com/apache/datafusion/pull/23239#discussion_r3508497075
##########
datafusion/sqllogictest/test_files/range_partitioning.slt:
##########
@@ -79,7 +87,182 @@ SELECT non_range_key, SUM(value) FROM range_partitioned
GROUP BY non_range_key O
##########
-# TEST 3: Join on Range Partition Column
+# TEST 3: Aggregate Reuses Range Subset Partitioning
+# With subset threshold met and preserve-file disabled, Range([range_key])
+# satisfies grouping by (range_key, non_range_key).
+##########
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 4;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, non_range_key, SUM(value) FROM range_partitioned
GROUP BY range_key, non_range_key;
+----
+physical_plan
+01)AggregateExec: mode=SinglePartitioned, gby=[range_key@0 as range_key,
non_range_key@1 as non_range_key], aggr=[sum(range_partitioned.value)]
+02)--DataSourceExec: file_groups={4 groups:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-0.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-1.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-2.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-3.csv]]},
projection=[range_key, non_range_key, value],
output_partitioning=Range([range_key@0 ASC], [(10), (20), (30)], 4),
file_type=csv, has_header=false
+
+query III
+SELECT range_key, non_range_key, SUM(value) FROM range_partitioned GROUP BY
range_key, non_range_key ORDER BY range_key, non_range_key;
+----
+1 1 10
+5 2 50
+10 1 100
+15 2 150
+20 1 200
+25 2 250
+30 1 300
+35 2 350
+
+
+##########
+# TEST 4: Exact Range Aggregate Below Subset Threshold
+# Even when subset satisfaction is disabled, exact Range([range_key])
+# satisfies GROUP BY range_key when repartitioning would not increase
+# partition count.
+##########
+
+statement ok
+set datafusion.execution.target_partitions = 4;
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 5;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, SUM(value) FROM range_partitioned GROUP BY range_key;
Review Comment:
I didnt run these because the physical plan is the same as test 1, is the
reason to duplicate to see if changes in physical plan result in a query result
mismatch?
##########
datafusion/sqllogictest/test_files/range_partitioning.slt:
##########
@@ -79,7 +87,182 @@ SELECT non_range_key, SUM(value) FROM range_partitioned
GROUP BY non_range_key O
##########
-# TEST 3: Join on Range Partition Column
+# TEST 3: Aggregate Reuses Range Subset Partitioning
+# With subset threshold met and preserve-file disabled, Range([range_key])
+# satisfies grouping by (range_key, non_range_key).
+##########
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 4;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, non_range_key, SUM(value) FROM range_partitioned
GROUP BY range_key, non_range_key;
+----
+physical_plan
+01)AggregateExec: mode=SinglePartitioned, gby=[range_key@0 as range_key,
non_range_key@1 as non_range_key], aggr=[sum(range_partitioned.value)]
+02)--DataSourceExec: file_groups={4 groups:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-0.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-1.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-2.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-3.csv]]},
projection=[range_key, non_range_key, value],
output_partitioning=Range([range_key@0 ASC], [(10), (20), (30)], 4),
file_type=csv, has_header=false
+
+query III
+SELECT range_key, non_range_key, SUM(value) FROM range_partitioned GROUP BY
range_key, non_range_key ORDER BY range_key, non_range_key;
+----
+1 1 10
+5 2 50
+10 1 100
+15 2 150
+20 1 200
+25 2 250
+30 1 300
+35 2 350
+
+
+##########
+# TEST 4: Exact Range Aggregate Below Subset Threshold
+# Even when subset satisfaction is disabled, exact Range([range_key])
+# satisfies GROUP BY range_key when repartitioning would not increase
+# partition count.
+##########
+
+statement ok
+set datafusion.execution.target_partitions = 4;
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 5;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, SUM(value) FROM range_partitioned GROUP BY range_key;
+----
+physical_plan
+01)AggregateExec: mode=SinglePartitioned, gby=[range_key@0 as range_key],
aggr=[sum(range_partitioned.value)]
+02)--DataSourceExec: file_groups={4 groups:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-0.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-1.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-2.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-3.csv]]},
projection=[range_key, value], output_partitioning=Range([range_key@0 ASC],
[(10), (20), (30)], 4), file_type=csv, has_header=false
+
+
+##########
+# TEST 5: Range Subset Aggregate Rehashes Below Subset Threshold
+# Range([range_key]) is only a subset of GROUP BY (range_key, non_range_key),
+# so it should not satisfy the aggregate key when subset satisfaction is
+# disabled.
+##########
+
+statement ok
+set datafusion.execution.target_partitions = 4;
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 5;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, non_range_key, SUM(value) FROM range_partitioned
GROUP BY range_key, non_range_key;
Review Comment:
ditto
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]