[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #5345: Refactor DecorrelateWhereExists and add back Distinct if needs

via GitHub Thu, 09 Mar 2023 18:16:27 -0800


mingmwang commented on code in PR #5345:
URL: https://github.com/apache/arrow-datafusion/pull/5345#discussion_r1131881136



##########
datafusion/optimizer/tests/integration-test.rs:
##########
@@ -151,8 +151,9 @@ fn where_exists_distinct() -> Result<()> {
     let plan = test_sql(sql)?;
     let expected = "LeftSemi Join: test.col_int32 = t2.col_int32\
     \n  TableScan: test projection=[col_int32]\
-    \n  SubqueryAlias: t2\
-    \n    TableScan: test projection=[col_int32]";
+    \n  Aggregate: groupBy=[[t2.col_int32]], aggr=[[]]\
+    \n    SubqueryAlias: t2\
+    \n      TableScan: test projection=[col_int32]";

Review Comment:
   Most of the time adding `Distinct`/`Aggregate` is better, it can reduce the 
data volumes before the `Join`(reduce the shuffle data volumes before the 
`Join` in a distribute engine).
   Regarding the positions of `Join`/`Aggregation`, this is a another topic, in 
some engines they leverage  CBO rules to decide pull down/pull up aggregations 
through Joins.   
   Snowflake implements a feature called Adaptive Aggregate placement, the 
basic idea is if the `Aggregate` added by the optimizer can not dedup, just 
skip the `Aggregate`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #5345: Refactor DecorrelateWhereExists and add back Distinct if needs

Reply via email to