jchen5 opened a new pull request, #39759:
URL: https://github.com/apache/spark/pull/39759

   ## What changes were proposed in this pull request?
   
   Adds support for subquery decorrelation with INTERSECT and EXCEPT operators 
on the correlation paths. For example:
   ```
   SELECT t1a, (
     SELECT avg(b) FROM (
       SELECT t2b as b FROM t2 WHERE t2a = t1a
       INTERSECT
       SELECT t3b as b FROM t3 WHERE t3a = t1a
   ))
   FROM t1
   ```
   
   This uses the same logic as for UNION decorrelation added in 
https://github.com/apache/spark/pull/39375.
   
   [This 
doc](https://docs.google.com/document/d/11b9ClCF2jYGU7vU2suOT7LRswYkg6tZ8_6xJbvxfh2I/edit#)
 describes how the decorrelation rewrite works for set operations and the code 
changes for it.
   
   In this PR, we always add DomainJoins for correlation through 
INTERSECT/EXCEPT, and never do direct substitution of the outer refs. That can 
also be added as an optimization in a follow-up - it only affects performance, 
not surface area coverage.
   
   ### Why are the changes needed?
   To improve subquery support in Spark.
   
   ### Does this PR introduce _any_ user-facing change?
   Before this change, queries like this would return an error like: 
`Decorrelate inner query through Intersect is not supported.`
   
   After this PR, this query can run successfully.
   
   ### How was this patch tested?
   Unit tests and SQL query tests.
   
   Factors tested included: (Some of these WIP to add)
   - Subquery type:
     - Eligible for DecorrelateInnerQuery: Scalar, lateral join
     - Not supported: EXISTS (new tests) and IN (existing tests)
   - UNION inside and outside subquery
   - Correlation in where, project, group by, aggregates, or no correlation
   - Project, Aggregate, Window under the Union
   - COUNT bug
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to