bersprockets opened a new pull request, #41809:
URL: https://github.com/apache/spark/pull/41809
### What changes were proposed in this pull request?
For full outer joins employing USING, set the nullability of the coalesced
join columns to true.
### Why are the changes needed?
The following query produces incorrect results:
```
create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values (2, 3) as (c1, c2);
select explode(array(c1)) as x
from v1
full outer join v2
using (c1);
-1 <== should be null
1
2
```
The following query fails with a `NullPointerException`:
```
create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
create or replace temp view v2 as values ('2', 3) as (c1, c2);
select explode(array(c1)) as x
from v1
full outer join v2
using (c1);
23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
11)
java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
...
```
The above full outer joins implicitly add an aliased coalesce to the parent
projection of the join: `coalesce(v1.c1, v2.c1) as c1`. In the case where only
one side's key is nullable, the coalesce's nullability is false. As a result,
the generator's output has nullable set as false. But this is incorrect: If one
side has a row with explicit null key values, the other side's row will also
have null key values (because the other side's row will be "made up"), and both
the `coalesce` and the `explode` will return a null value.
While `UpdateNullability` actually repairs the nullability of the `coalesce`
before execution, it doesn't recreate the generator output, so the nullability
remains incorrect in `Generate#output`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]