allisonwang-db commented on pull request #32179: URL: https://github.com/apache/spark/pull/32179#issuecomment-820083789
> BTW, is it necessary to be a subquery with aggregation? From the fix, I cannot tell how aggregation affects it. SPARK-17348 provides some explanations on why aggregate is causing the issue. Basically, when a correlated predicate is pulled up, all attributes from the inner query will be added as GROUP BY columns. When the mapping is not one-to-one, for instance `a + b = outer(c)` in the example above, both `a` and `b` will be added as group by columns, and the `count(*)` will count the number rows for each combination of (a, b), instead of (a + b). https://github.com/apache/spark/blob/7ff9d2e3eec514962e891420dbb3961e85826612/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L878-L901 Pull up correlated predicates through Aggregate: https://github.com/apache/spark/blob/3e218ade9cf6becc5de8b20a4385e345021a509d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala#L258-L264 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
