allisonwang-db opened a new pull request #33070:
URL: https://github.com/apache/spark/pull/33070


   ### What changes were proposed in this pull request?
   This PR modifies `DecorrelateInnerQuery` to handle the COUNT bug for lateral 
subqueries. Similar to SPARK-15370, rewriting lateral subqueries as joins can 
change the semantics of the subquery and lead to incorrect answers. 
   
   However we can't reuse the existing code to handle the count bug for 
correlated scalar subqueries because it assumes the subquery to have a specific 
shape (either with Filter + Aggregate or Aggregate as the root node). Instead, 
this PR proposes a more generic way to handle the COUNT bug. If an Aggregate is 
subject to the COUNT bug, we insert a left outer domain join between the outer 
query and the aggregate with a `alwaysTrue` marker and rewrite the final result 
conditioning on the marker. For example:
   
   ```sql
   -- t1: [(0, 1), (1, 2)]
   -- t2: [(0, 2), (0, 3)]
   select * from t1 left outer join lateral (select count(*) from t2 where 
t2.c1 = t1.c1)
   ```
   
   Without count bug handling, the query plan is
   ```
   Project [c1#44, c2#45, count(1)#53L]
   +- Join LeftOuter, (c1#48 = c1#44)
      :- LocalRelation [c1#44, c2#45]
      +- Aggregate [c1#48], [count(1) AS count(1)#53L, c1#48]
         +- LocalRelation [c1#48]
   ```
   and the answer is wrong: 
   ```
   +---+---+--------+
   |c1 |c2 |count(1)|
   +---+---+--------+
   |0  |1  |2       |
   |1  |2  |null    |
   +---+---+--------+
   ```
   
   With the count bug handling:
   ```
   Project [c1#1, c2#2, count(1)#10L]
   +- Join LeftOuter, (c1#34 <=> c1#1)
      :- LocalRelation [c1#1, c2#2]
      +- Project [if (isnull(alwaysTrue#32)) 0 else count(1)#33L AS 
count(1)#10L, c1#34]
         +- Join LeftOuter, (c1#5 = c1#34)
            :- Aggregate [c1#1], [c1#1 AS c1#34]
            :  +- LocalRelation [c1#1]
            +- Aggregate [c1#5], [count(1) AS count(1)#33L, c1#5, true AS 
alwaysTrue#32]
               +- LocalRelation [c1#5]
   ```
   and we have the correct answer:
   ```
   +---+---+--------+
   |c1 |c2 |count(1)|
   +---+---+--------+
   |0  |1  |2       |
   |1  |2  |0       |
   +---+---+--------+
   ```
   
   ### Why are the changes needed?
   Fix a correctness bug with lateral join rewrite.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added SQL query tests. The results are consistent with Postgres' results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to