[GitHub] [spark] allisonwang-db commented on pull request #32179: [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated

GitBox Wed, 14 Apr 2021 21:21:53 -0700


allisonwang-db commented on pull request #32179:
URL: https://github.com/apache/spark/pull/32179#issuecomment-820083789



   > BTW, is it necessary to be a subquery with aggregation? From the fix, I 
cannot tell how aggregation affects it.
   
   SPARK-17348 provides some explanations on why aggregate is causing the 
issue. Basically, when a correlated predicate is pulled up, all attributes from 
the inner query will be added as GROUP BY columns. When the mapping is not 
one-to-one, for instance `a + b = outer(c)` in the example above, both `a` and 
`b` will be added as group by columns, and the `count(*)` will count the number 
rows for each combination of (a, b), instead of (a + b).
   
https://github.com/apache/spark/blob/7ff9d2e3eec514962e891420dbb3961e85826612/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L878-L901
   
   Pull up correlated predicates through Aggregate:
   
https://github.com/apache/spark/blob/3e218ade9cf6becc5de8b20a4385e345021a509d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala#L258-L264


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] allisonwang-db commented on pull request #32179: [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated

Reply via email to