[PR] [SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join [spark]

via GitHub Tue, 05 Dec 2023 15:35:01 -0800


bersprockets opened a new pull request, #44193:
URL: https://github.com/apache/spark/pull/44193


   ### What changes were proposed in this pull request?
   
   In `RewritePredicateSubquery`, prune existence flags from the final join 
when `rewriteExistentialExpr` returns an existence join. This change prunes the 
flags (attributes with the name "exists") by adding a `Project` node.
   
   For example:
   ```
   Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
   ```
   becomes
   ```
   Project [a#13]
   +- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
      :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
      :  :- LocalRelation [a#13]
      :  +- LocalRelation [col1#17]
      +- LocalRelation [c1#15]
   ```
   This change always adds the `Project` node, whether `rewriteExistentialExpr` 
returns an existence join or not. In the case when `rewriteExistentialExpr` 
does not return an existence join, `RemoveNoopOperators` will remove the 
unneeded `Project` node.
   
   ### Why are the changes needed?
   
   This query returns an extraneous boolean column when run in spark-sql:
   ```
   create or replace temp view t1(a) as values (1), (2), (3), (7);
   create or replace temp view t2(c1) as values (1), (2), (3);
   create or replace temp view t3(col1) as values (3), (9);
   
   select *
   from t1
   where exists (
     select c1
     from t2
     where a = c1
     or a in (select col1 from t3)
   );
   
   1    false
   2    false
   3    true
   ```
   (Note: the above query will not have the extraneous boolean column when run 
from the Dataset API. That is because the Dataset API truncates the rows based 
on the schema of the analyzed plan. The bug occurs during optimization).
   
   This query fails when run in either spark-sql or using the Dataset API:
   ```
   select (
     select *
     from t1
     where exists (
       select c1
       from t2
       where a = c1
       or a in (select col1 from t3)
     )
     limit 1
   )
   from range(1);
   
   java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
something went wrong in analysis
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, except for the removal of the extraneous boolean flag and the fix to the 
error condition.
   
   ### How was this patch tested?
   
   New unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-45580][SQL] Handle case where a nested subquery becomes an existence join [spark]

Reply via email to