jchen5 opened a new pull request, #40811: URL: https://github.com/apache/spark/pull/40811
### What changes were proposed in this pull request? Fix a correctness bug for scalar subqueries with COUNT and a GROUP BY clause, for example: ``` create view t1(c1, c2) as values (0, 1), (1, 2); create view t2(c1, c2) as values (0, 2), (0, 3); select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from t1; -- Correct answer: [(0, 1, 2), (1, 2, null)] +---+---+------------------+ |c1 |c2 |scalarsubquery(c1)| +---+---+------------------+ |0 |1 |2 | |1 |2 |0 | +---+---+------------------+ ``` This is due to a bug in our "COUNT bug" handling for scalar subqueries. For a subquery with COUNT aggregate but no GROUP BY clause, 0 is the correct output on empty inputs, and we use the COUNT bug handling to construct the plan that yields 0 when there were no matched rows. But when there is a GROUP BY clause then NULL is the correct output, but we still incorrectly construct the same plan as in the former case and therefore incorrectly output 0. Instead, we need to only apply the COUNT bug handling when the scalar subquery had no GROUP BY clause. To fix this, we need to track whether the scalar subquery has a GROUP BY, i.e. a non-empty groupingExpressions for the Aggregate node. This need to be checked before DecorrelateInnerQuery, because that adds the correlated outer refs to the group-by list so after that the group-by is always non-empty. We save it in a boolean in the ScalarSubquery node until later when we rewrite the subquery into a join in constructLeftJoins. ### Why are the changes needed? Fix a correctness bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add SQL tests and unit tests. (Note that there were 2 existing unit tests for queries of this shape, which had the incorrect results as golden results.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
