gene-bordegaray commented on code in PR #23239:
URL: https://github.com/apache/datafusion/pull/23239#discussion_r3508497075
##########
datafusion/sqllogictest/test_files/range_partitioning.slt:
##########
@@ -79,7 +87,182 @@ SELECT non_range_key, SUM(value) FROM range_partitioned
GROUP BY non_range_key O
##########
-# TEST 3: Join on Range Partition Column
+# TEST 3: Aggregate Reuses Range Subset Partitioning
+# With subset threshold met and preserve-file disabled, Range([range_key])
+# satisfies grouping by (range_key, non_range_key).
+##########
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 4;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, non_range_key, SUM(value) FROM range_partitioned
GROUP BY range_key, non_range_key;
+----
+physical_plan
+01)AggregateExec: mode=SinglePartitioned, gby=[range_key@0 as range_key,
non_range_key@1 as non_range_key], aggr=[sum(range_partitioned.value)]
+02)--DataSourceExec: file_groups={4 groups:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-0.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-1.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-2.csv],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch_range_partitioning/range_partitioned/part-3.csv]]},
projection=[range_key, non_range_key, value],
output_partitioning=Range([range_key@0 ASC], [(10), (20), (30)], 4),
file_type=csv, has_header=false
+
+query III
+SELECT range_key, non_range_key, SUM(value) FROM range_partitioned GROUP BY
range_key, non_range_key ORDER BY range_key, non_range_key;
+----
+1 1 10
+5 2 50
+10 1 100
+15 2 150
+20 1 200
+25 2 250
+30 1 300
+35 2 350
+
+
+##########
+# TEST 4: Exact Range Aggregate Below Subset Threshold
+# Even when subset satisfaction is disabled, exact Range([range_key])
+# satisfies GROUP BY range_key when repartitioning would not increase
+# partition count.
+##########
+
+statement ok
+set datafusion.execution.target_partitions = 4;
+
+statement ok
+set datafusion.optimizer.subset_repartition_threshold = 5;
+
+statement ok
+set datafusion.optimizer.preserve_file_partitions = 0;
+
+query TT
+EXPLAIN SELECT range_key, SUM(value) FROM range_partitioned GROUP BY range_key;
Review Comment:
I didnt run these because the physical plan is the same as test 1, so I
didnt want the churn.
Out of curiosity for why, is the reason to duplicate to see if changes in
physical plan result in a query result mismatch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]