[GitHub] [spark] bersprockets opened a new pull request, #38440: [SPARK-40963][SQL] Set nullable correctly in project created by `ExtractGenerator`

GitBox Sun, 30 Oct 2022 17:12:25 -0700


bersprockets opened a new pull request, #38440:
URL: https://github.com/apache/spark/pull/38440


   ### What changes were proposed in this pull request?
   
   When creating the project list for the new projection In `ExtractGenerator`, 
take into account whether the generator is outer when setting nullable on 
generator-related output attributes.
   
   ### Why are the changes needed?
   
   This PR fixes an issue that can produce either incorrect results or a 
`NullPointerException`. It's a bit of an obscure issue in that I am 
hard-pressed to reproduce without using a subquery that has a inline table.
   
   Example:
   ```
   select c1, explode(c4) as c5 from (
     select c1, array(c3) as c4 from (
       select c1, explode_outer(c2) as c3
       from values
       (1, array(1, 2)),
       (2, array(2, 3)),
       (3, null)
       as data(c1, c2)
     )
   );
   
   +---+---+
   |c1 |c5 |
   +---+---+
   |1  |1  |
   |1  |2  |
   |2  |2  |
   |2  |3  |
   |3  |0  |
   +---+---+
   ```
   In the last row, `c5` is 0, but should be `NULL`.
   
   Another example:
   ```
   select c1, exists(c4, x -> x is null) as c5 from (
     select c1, array(c3) as c4 from (
       select c1, explode_outer(c2) as c3
       from values
       (1, array(1, 2)),
       (2, array(2, 3)),
       (3, null)
       as data(c1, c2)
     )
   );
   
   +---+-----+
   |c1 |c5   |
   +---+-----+
   |1  |false|
   |1  |false|
   |2  |false|
   |2  |false|
   |3  |false|
   +---+-----+
   ```
   In the last row, `false` should be `true`.
   
   In both cases, at the time `CreateArray(c3)` is instantiated, `c3`'s 
nullability is incorrect because the new projection created by 
`ExtractGenerator` uses `generatorOutput` from `explode_outer(c2)` as a 
projection list. `generatorOutput` doesn't take into account that 
`explode_outer(c2)` is an _outer_ explode, so the nullability setting is lost.
   
   `UpdateAttributeNullability` will eventually fix the nullable setting for 
attributes referring to `c3`, but it doesn't fix the `containsNull` setting for 
`c4` in `explode(c4)` (from the first example) or `exists(c4, x -> x is null)` 
(from the second example).
   
   This example fails with a `NullPointerException`:
   ```
   select c1, inline_outer(c4) from (
     select c1, array(c3) as c4 from (
       select c1, explode_outer(c2) as c3
       from values
       (1, array(named_struct('a', 1, 'b', 2))),
       (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
       (3, null)
       as data(c1, c2)
     )
   );
   
   22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
   java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bersprockets opened a new pull request, #38440: [SPARK-40963][SQL] Set nullable correctly in project created by `ExtractGenerator`

Reply via email to