bersprockets opened a new pull request, #36883:
URL: https://github.com/apache/spark/pull/36883

   ### What changes were proposed in this pull request?
   
   Change `Inline#elementSchema` to make each struct field nullable when the 
containing array has a null element.
   
   ### Why are the changes needed?
   
   This query returns incorrect results (the last row should be `NULL NULL`):
   ```
   spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
   1    2
   -1   -1
   Time taken: 4.053 seconds, Fetched 2 row(s)
   spark-sql>
   ```
   And this query gets a NullPointerException:
   ```
   spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
   22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
   java.lang.NullPointerException: null
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
 ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown
 Source) ~[?:?]
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) ~[?:?]
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buffere
   ```
   When an array of structs is created by `CreateArray`, and no struct field 
contains a literal null value, the schema for the struct will have non-nullable 
fields, even if the array itself has a null entry (as in the example above). As 
a result, the output attributes for the generator will be non-nullable.
   
   When the output attributes for `Inline` are non-nullable, 
`GenerateUnsafeProjection#writeExpressionsToBuffer` generates incorrect code 
for null structs.
   
   In more detail, the issue is this: `GenerateExec#codeGenCollection` 
generates code that will check if the struct instance (i.e., array element) is 
null and, if so, set a boolean for each struct field to indicate that the field 
contains a null. However, unless the generator's output attributes are 
nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` will not generate 
any code to check those booleans. Instead it will generate code to write out 
whatever is in the variables that normally hold the struct values (which will 
be garbage if the array element is null).
   
   Arrays of structs from file sources do not have this issue. In that case, 
each `StructField` will have nullable=true due to 
[this](https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L417).
   
   (Note: the eval path for `Inline` has a different bug with null array 
elements that occurs even when `nullable` is set correctly in the schema, but I 
will address that in a separate PR).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to