allisonwang-db opened a new pull request #33903:
URL: https://github.com/apache/spark/pull/33903
### What changes were proposed in this pull request?
This PR adds an additional check in the `CollapseProject` rule to prevent
inlining expressions with correlated scalar subqueries.
### Why are the changes needed?
To avoid introducing unnecessary left outer joins when rewriting correlated
subqueries. Projects should be collapsed after the scalar subqueries have been
rewritten into left outer joins.
For example
```
// Before CollapseProject
Project [c1, s, (s * 10)]
+- Project [c1, scalar-subquery [c1] AS s]
: +- Aggregate [c1], [first(c2), c1]
: +- LocalRelation [c1, c2]
+- LocalRelation [c1, c2]
// After (scalar subqueries are inlined)
Project [c1, scalar-subquery [c1], (scalar-subquery [c1] * 10)]
: +- Aggregate [c1], [first(c2), c1]
: +- LocalRelation [c1, c2]
: +- Aggregate [c1], [first(c2), c1]
: +- LocalRelation [c1, c2]
+- LocalRelation [c1, c2]
```
Then the rule `RewriteCorrelatedScalarSubquery` rule will create two left
outer joins instead of one. Also, since duplicate join attributes are not
handled in this rule, this rewrite will break the structural integrity of the
plan.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test. Both test cases throw the following error before this PR:
```
After applying rule
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery in
batch Operator Optimization before Inferring Filters, the structural integrity
of the plan is broken.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]