[GitHub] [spark] bersprockets opened a new pull request, #39970: [SPARK-42401][SQL] Set `containsNull` correctly in the data type for array_insert/array_append

via GitHub Fri, 10 Feb 2023 15:36:09 -0800


bersprockets opened a new pull request, #39970:
URL: https://github.com/apache/spark/pull/39970


   ### What changes were proposed in this pull request?
   
   In the `DataType` instance returned by `ArrayInsert#dataType` and 
`ArrayAppend#dataType`, set `containsNull` to true if either
   
   - the input array has `containsNull` set to true
   - the expression to be inserted/appended is nullable.
   
   ### Why are the changes needed?
   
   The following two queries return the wrong answer:
   ```
   spark-sql> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
   [1,2,3,4,0] <== should be [1,2,3,4,null]
   Time taken: 3.879 seconds, Fetched 1 row(s)
   spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
   [1,2,3,4,0] <== should be [1,2,3,4,null]
   Time taken: 0.068 seconds, Fetched 1 row(s)
   spark-sql> 
   ```
   The following two queries throw a `NullPointerException`:
   ```
   spark-sql> select array_insert(array('1', '2', '3', '4'), 5, cast(null as 
string));
   23/02/10 11:24:59 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
   java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   ...
   spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
string));
   23/02/10 11:25:10 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
   java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   ...
   spark-sql> 
   ```
   The bug arises because both `ArrayInsert` and `ArrayAppend` use the first 
child's data type as the function's data type. That is, it uses the first 
child's `containsNull` setting, regardless of whether the insert/append 
operation might produce an array containing a null value.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bersprockets opened a new pull request, #39970: [SPARK-42401][SQL] Set `containsNull` correctly in the data type for array_insert/array_append

Reply via email to