bersprockets opened a new pull request, #44193:
URL: https://github.com/apache/spark/pull/44193
### What changes were proposed in this pull request?
In `RewritePredicateSubquery`, prune existence flags from the final join
when `rewriteExistentialExpr` returns an existence join. This change prunes the
flags (attributes with the name "exists") by adding a `Project` node.
For example:
```
Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
: :- LocalRelation [a#13]
: +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
becomes
```
Project [a#13]
+- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
:- Join ExistenceJoin(exists#19), (a#13 = col1#17)
: :- LocalRelation [a#13]
: +- LocalRelation [col1#17]
+- LocalRelation [c1#15]
```
This change always adds the `Project` node, whether `rewriteExistentialExpr`
returns an existence join or not. In the case when `rewriteExistentialExpr`
does not return an existence join, `RemoveNoopOperators` will remove the
unneeded `Project` node.
### Why are the changes needed?
This query returns an extraneous boolean column when run in spark-sql:
```
create or replace temp view t1(a) as values (1), (2), (3), (7);
create or replace temp view t2(c1) as values (1), (2), (3);
create or replace temp view t3(col1) as values (3), (9);
select *
from t1
where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
);
1 false
2 false
3 true
```
(Note: the above query will not have the extraneous boolean column when run
from the Dataset API. That is because the Dataset API truncates the rows based
on the schema of the analyzed plan. The bug occurs during optimization).
This query fails when run in either spark-sql or using the Dataset API:
```
select (
select *
from t1
where exists (
select c1
from t2
where a = c1
or a in (select col1 from t3)
)
limit 1
)
from range(1);
java.lang.AssertionError: assertion failed: Expects 1 field, but got 2;
something went wrong in analysis
```
### Does this PR introduce _any_ user-facing change?
No, except for the removal of the extraneous boolean flag and the fix to the
error condition.
### How was this patch tested?
New unit test.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]