[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #7566: Enhance/Refactor Ordering Equivalence Properties

via GitHub Mon, 18 Sep 2023 00:44:33 -0700


mustafasrepo commented on code in PR #7566:
URL: https://github.com/apache/arrow-datafusion/pull/7566#discussion_r1328339665



##########
datafusion/sqllogictest/test_files/subquery.slt:
##########
@@ -284,19 +284,20 @@ Projection: t1.t1_id, __scalar_sq_1.SUM(t2.t2_int) AS 
t2_sum
 ------------TableScan: t2 projection=[t2_id, t2_int]
 physical_plan
 ProjectionExec: expr=[t1_id@0 as t1_id, SUM(t2.t2_int)@1 as t2_sum]
---CoalesceBatchesExec: target_batch_size=8192
-----HashJoinExec: mode=Partitioned, join_type=Left, on=[(t1_id@0, t2_id@1)]
-------CoalesceBatchesExec: target_batch_size=8192
---------RepartitionExec: partitioning=Hash([t1_id@0], 4), input_partitions=4
-----------MemoryExec: partitions=4, partition_sizes=[1, 0, 0, 0]
-------ProjectionExec: expr=[SUM(t2.t2_int)@1 as SUM(t2.t2_int), t2_id@0 as 
t2_id]
+--ProjectionExec: expr=[t1_id@2 as t1_id, SUM(t2.t2_int)@0 as SUM(t2.t2_int), 
t2_id@1 as t2_id]
+----CoalesceBatchesExec: target_batch_size=8192
+------HashJoinExec: mode=Partitioned, join_type=Right, on=[(t2_id@1, t1_id@0)]

Review Comment:
   Since with the changes in this PR, filter doesn't return 
`Statistics::default`. It propagates `num_rows` of `AggregateExec` above. Then 
`join_selection` rule chooses filter side as build (since filter side has less 
rows than the other side). Previously, since number of rows is not propagated, 
`join_selection` rule didn't change sides. Since, we can propagate additional 
information now, planner can choose better build side



##########
datafusion/sqllogictest/test_files/subquery.slt:
##########
@@ -284,19 +284,20 @@ Projection: t1.t1_id, __scalar_sq_1.SUM(t2.t2_int) AS 
t2_sum
 ------------TableScan: t2 projection=[t2_id, t2_int]
 physical_plan
 ProjectionExec: expr=[t1_id@0 as t1_id, SUM(t2.t2_int)@1 as t2_sum]
---CoalesceBatchesExec: target_batch_size=8192
-----HashJoinExec: mode=Partitioned, join_type=Left, on=[(t1_id@0, t2_id@1)]
-------CoalesceBatchesExec: target_batch_size=8192
---------RepartitionExec: partitioning=Hash([t1_id@0], 4), input_partitions=4
-----------MemoryExec: partitions=4, partition_sizes=[1, 0, 0, 0]
-------ProjectionExec: expr=[SUM(t2.t2_int)@1 as SUM(t2.t2_int), t2_id@0 as 
t2_id]
+--ProjectionExec: expr=[t1_id@2 as t1_id, SUM(t2.t2_int)@0 as SUM(t2.t2_int), 
t2_id@1 as t2_id]
+----CoalesceBatchesExec: target_batch_size=8192
+------HashJoinExec: mode=Partitioned, join_type=Right, on=[(t2_id@1, t1_id@0)]

Review Comment:
   Since with the changes in this PR, filter doesn't return 
`Statistics::default`. It propagates `num_rows` of `AggregateExec` above. Then 
`join_selection` rule chooses filter side as build (since filter side has less 
rows than the other side). Previously, since number of rows is not propagated, 
`join_selection` rule didn't change sides. Since, we propagate additional 
information now, planner can choose better build side



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #7566: Enhance/Refactor Ordering Equivalence Properties

Reply via email to