Github user frreiss commented on the pull request: https://github.com/apache/spark/pull/13155#issuecomment-220684859 I've added additional changes to cover two additional cases that @hvanhovell pointed out on review, plus one additional case that came up while fixing the first two: - The correlated subquery may have a HAVING clause - The correlated subquery may be nested inside additional query blocks that apply projections - The correlated subquery may return NULL when the correlation bindings join with the subquery and a non-NULL value when bindings do *not* join, e.g. ```sql select l.a from l where (select case when count(*) = 1 then null else count(*) end from r where l.a = r.c) = 0 ``` The logic for statically evaluating the subquery's aggregate expression now handles an Aggregate node with a chain of Filter and Project operators above it. The rewrite logic has an third case to handle subqueries with HAVING clauses. The second case from the original change set now adds an additional column to cover subqueries that return NULL with bindings that join the inner query block. I also addressed various other more minor review comments. After these changes, the algorithm for preventing the COUNT bug in scalar subqueries is now the following: ``` V <- <value that the subquery returns when zero tuples join> if (V is null) then Use the original rewrite from SPARK-14785. else if (subquery has a Filter above the Aggregate node) then Rewrite the Filter node to a Project that adds a Boolean column isFiltered. Rewrite nodes above the Filter node so they pass through the isFiltered column. else Add an isFiltered column with a hard-coded value of false to the top Project in the subquery. endif Create a left outer join between the outer query block and the rewritten subquery. Put the following case statement into the Project operator above the outer join. case when isFiltered is null then coalesce(aggVal, V) when isFiltered then null else aggVal endif ``` ### Correctness proof Let *b* denote a tuple of correlation bindings from the outer query block. Without loss of generality, assume that the correlated subquery has the plan `Project(Filter(Aggregate(Join({b},T))))`, where *T* is a table. Note that upstream checks in Catalyst ensure that the Aggregate node will produce exactly one tuple. Consider the evaluation of the original subquery on *b*. We can ask three questions about this evaluation: - Did *b* join with one or more tuples from *T*? - Did the Filter node reject the tuple that the Aggregate node returned? - Did the subquery return `null`? The answer to each of these questions must be "yes" or "no", leading to eight cases: Case \# | Empty join? | Agg filtered? | SQ returns null? ------ | ----------- | ------------- | ---------------- 1 | Yes | Yes | Yes 2 | Yes | Yes | No 3 | Yes | No | Yes 4 | Yes | No | No 5 | No | Yes | Yes 6 | No | Yes | No 7 | No | No | Yes 8 | No | No | No Cases 2 and 6 are impossible. For the remaining cases, the rewritten subquery returns the correct result: - Case 1 and 3 => Subquery returns null on empty join result, so first branch of rewrite algorithm above applies. Outer join returns null for subquery result. - Case 4 => Subquery returns non-null answer on empty join result, so second or third branch of rewrite applies. Outer join in the rewritten query returns a tuple with `(aggVal, isFiltered)` set to `(null,null)`, so case statement at the top of the rewritten query returns the answer the subquery returns on an empty join, which is the correct answer. - Case 5 => If first branch of rewrite applies, the outer join in the rewritten query returns null. If second or third branch of rewrite applies, then outer join in rewritten query returns a tuple with `(aggVal, isFiltered)` set to `(null, true)`, so case statement at top of rewritten query returns null. - Case 7 => If first branch of rewrite applies, the outer join in the rewritten query returns null. For second or third branch of rewrite, outer join in rewritten query returns a tuple with `(aggVal, isFiltered)` set to `(null, false)`, so case statement at top of rewritten query returns null. - Case 8 => If first branch of rewrite applies, the outer join in the rewritten query returns null. For second or third branch of rewrite, outer join in rewritten query returns a tuple with `(aggVal, isFiltered)` set to `(<non-null value>, false)`, so case statement at top of rewritten query returns the non-null value of aggVal.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org