Github user frreiss commented on the pull request:

    https://github.com/apache/spark/pull/13155#issuecomment-220684859
  
    I've added additional changes to cover two additional cases that 
@hvanhovell pointed out on review, plus one additional case that came up while 
fixing the first two:
    - The correlated subquery may have a HAVING clause
    - The correlated subquery may be nested inside additional query blocks that 
apply projections
    - The correlated subquery may return NULL when the correlation bindings join
      with the subquery and a non-NULL value when bindings do *not* join, e.g. 
    ```sql
    select l.a from l where
      (select case when count(*) = 1 then null else count(*) end from r where 
l.a = r.c) = 0
    ```
    
    The logic for statically evaluating the subquery's aggregate expression now 
handles an Aggregate node with a chain of Filter and Project operators above 
it. The rewrite logic has an third case to handle subqueries with HAVING 
clauses. The second case from the original change set now adds an additional 
column to cover subqueries that return NULL with bindings that join the inner 
query block. I also addressed various other more minor review comments.
    
    After these changes, the algorithm for preventing the COUNT bug in scalar 
subqueries is now the following:
    ```
    V <- <value that the subquery returns when zero tuples join>
    if (V is null)
    then
      Use the original rewrite from SPARK-14785.
    else 
      if (subquery has a Filter above the Aggregate node)
      then
        Rewrite the Filter node to a Project that adds a Boolean column 
isFiltered.
        Rewrite nodes above the Filter node so they pass through the isFiltered 
column.
      else
        Add an isFiltered column with a hard-coded value of false to the top 
Project in the subquery.
      endif
      Create a left outer join between the outer query block and the rewritten 
subquery.
      Put the following case statement into the Project operator above the 
outer join.
          case 
           when isFiltered is null then coalesce(aggVal, V)
           when isFiltered then null
           else aggVal
    endif
    ```
    
    ### Correctness proof 
    
    Let *b* denote a tuple of correlation bindings from the outer query block.  
Without loss of generality, assume that the correlated subquery has the plan 
`Project(Filter(Aggregate(Join({b},T))))`, where *T* is a table. Note that 
upstream checks in Catalyst ensure that the Aggregate node will produce exactly 
one tuple.
    
    Consider the evaluation of the original subquery on *b*. We can ask three 
questions about this evaluation:
    - Did *b* join with one or more tuples from *T*?
    - Did the Filter node reject the tuple that the Aggregate node returned?
    - Did the subquery return `null`?
    
    The answer to each of these questions must be "yes" or "no", leading to 
eight cases:
    
    Case \# | Empty join? | Agg filtered? | SQ returns null?
    ------ | ----------- | ------------- | ----------------
    1      | Yes         | Yes           | Yes
    2      | Yes         | Yes           | No
    3      | Yes         | No            | Yes
    4      | Yes         | No            | No
    5      | No          | Yes           | Yes
    6      | No          | Yes           | No
    7      | No          | No            | Yes
    8      | No          | No            | No
    
    Cases 2 and 6 are impossible. For the remaining cases, the rewritten 
subquery returns the correct result:
    - Case 1 and 3 => Subquery returns null on empty join result, so first 
branch of rewrite algorithm above applies. Outer join returns null for subquery 
result.
    - Case 4 => Subquery returns non-null answer on empty join result, so 
second or third branch of rewrite applies. Outer join in the rewritten query 
returns a tuple with `(aggVal, isFiltered)` set to `(null,null)`, so case 
statement at the top of the rewritten query returns the answer the subquery 
returns on an empty join, which is the correct answer.
    - Case 5 => If first branch of rewrite applies, the outer join in the 
rewritten query returns null. If second or third branch of rewrite applies, 
then outer join in rewritten query returns a tuple with `(aggVal, isFiltered)` 
set to `(null, true)`, so case statement at top of rewritten query returns null.
    - Case 7 => If first branch of rewrite applies, the outer join in the 
rewritten query returns null. For second or third branch of rewrite, outer join 
in rewritten query returns a tuple with `(aggVal, isFiltered)` set to `(null, 
false)`, so case statement at top of rewritten query returns null.
    - Case 8 => If first branch of rewrite applies, the outer join in the 
rewritten query returns null. For second or third branch of rewrite, outer join 
in rewritten query returns a tuple with `(aggVal, isFiltered)` set to 
`(<non-null value>, false)`, so case statement at top of rewritten query 
returns the non-null value of aggVal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to