Re: [PR] Do not add redundant subquery ordering into plan [datafusion]

via GitHub Fri, 16 Aug 2024 10:47:27 -0700


alamb commented on code in PR #12003:
URL: https://github.com/apache/datafusion/pull/12003#discussion_r1720135221



##########
datafusion/sqllogictest/test_files/joins.slt:
##########
@@ -3898,6 +3898,7 @@ SELECT * FROM (
 ) as lhs RIGHT JOIN (
     SELECT * from right_table_no_nulls
     ORDER BY b
+    LIMIT 10

Review Comment:
   Rather than modifying the existing test, can you please add a new one ? If 
we modify the existing test we may also inadvertently be changing the coverage 
of this test
   
   Likewise for the other tests in this file that now have limits



##########
datafusion/sqllogictest/test_files/select.slt:
##########
@@ -990,13 +990,13 @@ FROM (
     ) AS a
 ) AS b
 ----
-a 5 -101
-a 5 -54
 a 5 -38
+a 5 -54

Review Comment:
   this chnage looks like it is because the test itself is not deterministic



##########
datafusion/core/src/dataframe/mod.rs:
##########
@@ -2923,13 +2923,12 @@ mod tests {
         assert_eq!(
             "\
         Projection: t1.c1, t2.c1, Boolean(true) AS new_column\
-        \n  Limit: skip=0, fetch=1\
-        \n    Sort: t1.c1 ASC NULLS FIRST, fetch=1\
-        \n      Inner Join: t1.c1 = t2.c1\
-        \n        SubqueryAlias: t1\
-        \n          TableScan: aggregate_test_100 projection=[c1]\
-        \n        SubqueryAlias: t2\
-        \n          TableScan: aggregate_test_100 projection=[c1]",
+        \n  Sort: t1.c1 ASC NULLS FIRST, fetch=1\

Review Comment:
   This plan looks better to me because since the sort already has a fetch 
there is no reason to also apply a limit afterwards 



##########
datafusion/sqllogictest/test_files/aggregates_topk.slt:
##########
@@ -95,77 +93,69 @@ query TT
 explain select trace_id, MAX(timestamp) from traces group by trace_id order by 
MAX(timestamp) desc limit 4;
 ----
 logical_plan
-01)Limit: skip=0, fetch=4
-02)--Sort: max(traces.timestamp) DESC NULLS FIRST, fetch=4
-03)----Aggregate: groupBy=[[traces.trace_id]], aggr=[[max(traces.timestamp)]]
-04)------TableScan: traces projection=[trace_id, timestamp]
+01)Sort: max(traces.timestamp) DESC NULLS FIRST, fetch=4
+02)--Aggregate: groupBy=[[traces.trace_id]], aggr=[[max(traces.timestamp)]]
+03)----TableScan: traces projection=[trace_id, timestamp]
 physical_plan
-01)GlobalLimitExec: skip=0, fetch=4
-02)--SortPreservingMergeExec: [max(traces.timestamp)@1 DESC], fetch=4
-03)----SortExec: TopK(fetch=4), expr=[max(traces.timestamp)@1 DESC], 
preserve_partitioning=[true]
-04)------AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)], lim=[4]
-05)--------CoalesceBatchesExec: target_batch_size=8192
-06)----------RepartitionExec: partitioning=Hash([trace_id@0], 4), 
input_partitions=4
-07)------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
-08)--------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)], lim=[4]
-09)----------------MemoryExec: partitions=1, partition_sizes=[1]
+01)SortPreservingMergeExec: [max(traces.timestamp)@1 DESC], fetch=4
+02)--SortExec: TopK(fetch=4), expr=[max(traces.timestamp)@1 DESC], 
preserve_partitioning=[true]
+03)----AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)], lim=[4]
+04)------CoalesceBatchesExec: target_batch_size=8192
+05)--------RepartitionExec: partitioning=Hash([trace_id@0], 4), 
input_partitions=4
+06)----------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
+07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)], lim=[4]
+08)--------------MemoryExec: partitions=1, partition_sizes=[1]
 
 query TT
 explain select trace_id, MIN(timestamp) from traces group by trace_id order by 
MIN(timestamp) desc limit 4;
 ----
 logical_plan
-01)Limit: skip=0, fetch=4
-02)--Sort: min(traces.timestamp) DESC NULLS FIRST, fetch=4
-03)----Aggregate: groupBy=[[traces.trace_id]], aggr=[[min(traces.timestamp)]]
-04)------TableScan: traces projection=[trace_id, timestamp]
+01)Sort: min(traces.timestamp) DESC NULLS FIRST, fetch=4
+02)--Aggregate: groupBy=[[traces.trace_id]], aggr=[[min(traces.timestamp)]]
+03)----TableScan: traces projection=[trace_id, timestamp]
 physical_plan
-01)GlobalLimitExec: skip=0, fetch=4
-02)--SortPreservingMergeExec: [min(traces.timestamp)@1 DESC], fetch=4
-03)----SortExec: TopK(fetch=4), expr=[min(traces.timestamp)@1 DESC], 
preserve_partitioning=[true]
-04)------AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[min(traces.timestamp)]
-05)--------CoalesceBatchesExec: target_batch_size=8192
-06)----------RepartitionExec: partitioning=Hash([trace_id@0], 4), 
input_partitions=4
-07)------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
-08)--------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[min(traces.timestamp)]
-09)----------------MemoryExec: partitions=1, partition_sizes=[1]
+01)SortPreservingMergeExec: [min(traces.timestamp)@1 DESC], fetch=4
+02)--SortExec: TopK(fetch=4), expr=[min(traces.timestamp)@1 DESC], 
preserve_partitioning=[true]
+03)----AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[min(traces.timestamp)]
+04)------CoalesceBatchesExec: target_batch_size=8192
+05)--------RepartitionExec: partitioning=Hash([trace_id@0], 4), 
input_partitions=4
+06)----------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
+07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[min(traces.timestamp)]
+08)--------------MemoryExec: partitions=1, partition_sizes=[1]
 
 query TT
 explain select trace_id, MAX(timestamp) from traces group by trace_id order by 
MAX(timestamp) asc limit 4;
 ----
 logical_plan
-01)Limit: skip=0, fetch=4
-02)--Sort: max(traces.timestamp) ASC NULLS LAST, fetch=4
-03)----Aggregate: groupBy=[[traces.trace_id]], aggr=[[max(traces.timestamp)]]
-04)------TableScan: traces projection=[trace_id, timestamp]
+01)Sort: max(traces.timestamp) ASC NULLS LAST, fetch=4
+02)--Aggregate: groupBy=[[traces.trace_id]], aggr=[[max(traces.timestamp)]]
+03)----TableScan: traces projection=[trace_id, timestamp]
 physical_plan
-01)GlobalLimitExec: skip=0, fetch=4
-02)--SortPreservingMergeExec: [max(traces.timestamp)@1 ASC NULLS LAST], fetch=4
-03)----SortExec: TopK(fetch=4), expr=[max(traces.timestamp)@1 ASC NULLS LAST], 
preserve_partitioning=[true]
-04)------AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)]
-05)--------CoalesceBatchesExec: target_batch_size=8192
-06)----------RepartitionExec: partitioning=Hash([trace_id@0], 4), 
input_partitions=4
-07)------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
-08)--------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)]
-09)----------------MemoryExec: partitions=1, partition_sizes=[1]
+01)SortPreservingMergeExec: [max(traces.timestamp)@1 ASC NULLS LAST], fetch=4
+02)--SortExec: TopK(fetch=4), expr=[max(traces.timestamp)@1 ASC NULLS LAST], 
preserve_partitioning=[true]
+03)----AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)]
+04)------CoalesceBatchesExec: target_batch_size=8192
+05)--------RepartitionExec: partitioning=Hash([trace_id@0], 4), 
input_partitions=4
+06)----------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
+07)------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)]
+08)--------------MemoryExec: partitions=1, partition_sizes=[1]
 
 query TT
 explain select trace_id, MAX(timestamp) from traces group by trace_id order by 
trace_id asc limit 4;
 ----
 logical_plan
-01)Limit: skip=0, fetch=4
-02)--Sort: traces.trace_id ASC NULLS LAST, fetch=4
-03)----Aggregate: groupBy=[[traces.trace_id]], aggr=[[max(traces.timestamp)]]
-04)------TableScan: traces projection=[trace_id, timestamp]
+01)Sort: traces.trace_id ASC NULLS LAST, fetch=4
+02)--Aggregate: groupBy=[[traces.trace_id]], aggr=[[max(traces.timestamp)]]
+03)----TableScan: traces projection=[trace_id, timestamp]
 physical_plan
-01)GlobalLimitExec: skip=0, fetch=4
-02)--SortPreservingMergeExec: [trace_id@0 ASC NULLS LAST], fetch=4
-03)----SortExec: TopK(fetch=4), expr=[trace_id@0 ASC NULLS LAST], 
preserve_partitioning=[true]
-04)------AggregateExec: mode=FinalPartitioned, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)]
-05)--------CoalesceBatchesExec: target_batch_size=8192
-06)----------RepartitionExec: partitioning=Hash([trace_id@0], 4), 
input_partitions=4
-07)------------RepartitionExec: partitioning=RoundRobinBatch(4), 
input_partitions=1
-08)--------------AggregateExec: mode=Partial, gby=[trace_id@0 as trace_id], 
aggr=[max(traces.timestamp)]
-09)----------------MemoryExec: partitions=1, partition_sizes=[1]
+01)SortPreservingMergeExec: [trace_id@0 ASC NULLS LAST], fetch=4

Review Comment:
   it sure is getting easier to understand what the plans are doing here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Do not add redundant subquery ordering into plan [datafusion]

Reply via email to