bersprockets opened a new pull request, #38440:
URL: https://github.com/apache/spark/pull/38440
### What changes were proposed in this pull request?
When creating the project list for the new projection In `ExtractGenerator`,
take into account whether the generator is outer when setting nullable on
generator-related output attributes.
### Why are the changes needed?
This PR fixes an issue that can produce either incorrect results or a
`NullPointerException`. It's a bit of an obscure issue in that I am
hard-pressed to reproduce without using a subquery that has a inline table.
Example:
```
select c1, explode(c4) as c5 from (
select c1, array(c3) as c4 from (
select c1, explode_outer(c2) as c3
from values
(1, array(1, 2)),
(2, array(2, 3)),
(3, null)
as data(c1, c2)
)
);
+---+---+
|c1 |c5 |
+---+---+
|1 |1 |
|1 |2 |
|2 |2 |
|2 |3 |
|3 |0 |
+---+---+
```
In the last row, `c5` is 0, but should be `NULL`.
Another example:
```
select c1, exists(c4, x -> x is null) as c5 from (
select c1, array(c3) as c4 from (
select c1, explode_outer(c2) as c3
from values
(1, array(1, 2)),
(2, array(2, 3)),
(3, null)
as data(c1, c2)
)
);
+---+-----+
|c1 |c5 |
+---+-----+
|1 |false|
|1 |false|
|2 |false|
|2 |false|
|3 |false|
+---+-----+
```
In the last row, `false` should be `true`.
In both cases, at the time `CreateArray(c3)` is instantiated, `c3`'s
nullability is incorrect because the new projection created by
`ExtractGenerator` uses `generatorOutput` from `explode_outer(c2)` as a
projection list. `generatorOutput` doesn't take into account that
`explode_outer(c2)` is an _outer_ explode, so the nullability setting is lost.
`UpdateAttributeNullability` will eventually fix the nullable setting for
attributes referring to `c3`, but it doesn't fix the `containsNull` setting for
`c4` in `explode(c4)` (from the first example) or `exists(c4, x -> x is null)`
(from the second example).
This example fails with a `NullPointerException`:
```
select c1, inline_outer(c4) from (
select c1, array(c3) as c4 from (
select c1, explode_outer(c2) as c3
from values
(1, array(named_struct('a', 1, 'b', 2))),
(2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))),
(3, null)
as data(c1, c2)
)
);
22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]