jchen5 opened a new pull request, #39375:
URL: https://github.com/apache/spark/pull/39375
## What changes were proposed in this pull request?
Adds support for subquery decorrelation with UNION operators on the
correlation paths. For example:
```
SELECT t1a, (
SELECT avg(b) FROM (
SELECT t2b as b FROM t2 WHERE t2a = t1a
UNION
SELECT t3b as b FROM t3 WHERE t3a = t1a
))
FROM t1
```
[This
doc](https://docs.google.com/document/d/11b9ClCF2jYGU7vU2suOT7LRswYkg6tZ8_6xJbvxfh2I/edit#)
describes how the decorrelation rewrite works for UNION and the code changes
for it.
The other set operators INTERSECT and EXCEPT should work with the same
logic, I will add those in a follow-up after this PR is ready.
In this PR, we always add DomainJoins for correlation through UNIONs, and
never do direct substitution of the outer refs. That can also be added as an
optimization in a follow-up - it only affects performance, not surface area
coverage.
### Why are the changes needed?
To improve subquery support in Spark.
### Does this PR introduce _any_ user-facing change?
Before this change, queries like this would return an error like:
`Decorrelate inner query through Union is not supported.`
After this PR, this query can run successfully.
### How was this patch tested?
Unit tests and SQL query tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]