[GitHub] [spark] bersprockets opened a new pull request, #41809: [SPARK-44251][SQL] Set nullable correctly on coalesced join key in full outer USING join

via GitHub Fri, 30 Jun 2023 10:14:47 -0700


bersprockets opened a new pull request, #41809:
URL: https://github.com/apache/spark/pull/41809


   ### What changes were proposed in this pull request?
   
   For full outer joins employing USING, set the nullability of the coalesced 
join columns to true.
   
   ### Why are the changes needed?
   
   The following query produces incorrect results:
   ```
   create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2);
   create or replace temp view v2 as values (2, 3) as (c1, c2);
   
   select explode(array(c1)) as x
   from v1
   full outer join v2
   using (c1);
   
   -1   <== should be null
   1
   2
   ```
   The following query fails with a `NullPointerException`:
   ```
   create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2);
   create or replace temp view v2 as values ('2', 3) as (c1, c2);
   
   select explode(array(c1)) as x
   from v1
   full outer join v2
   using (c1);
   
   23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 
11)
   java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
   ...
   ```
   The above full outer joins implicitly add an aliased coalesce to the parent 
projection of the join: `coalesce(v1.c1, v2.c1) as c1`. In the case where only 
one side's key is nullable, the coalesce's nullability is false. As a result, 
the generator's output has nullable set as false. But this is incorrect: If one 
side has a row with explicit null key values, the other side's row will also 
have null key values (because the other side's row will be "made up"), and both 
the `coalesce` and the `explode` will return a null value.
   
   While `UpdateNullability` actually repairs the nullability of the `coalesce` 
before execution, it doesn't recreate the generator output, so the nullability 
remains incorrect in `Generate#output`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bersprockets opened a new pull request, #41809: [SPARK-44251][SQL] Set nullable correctly on coalesced join key in full outer USING join

Reply via email to