Jefffrey opened a new issue, #19682: URL: https://github.com/apache/datafusion/issues/19682
Current behaviour is per this issue comment: https://github.com/apache/datafusion/issues/8181#issuecomment-1817864491 > for `array_union`: > ``` > array_union([], []) = [] > array_union([], null) = [] > array_union(null, []) = [] > array_union(null, null) = null > ``` > > for `array_except`: > ``` > array_except([], []) = [] > array_except([], null) = [] > array_except(null, []) = null > array_except(null, null) = null > ``` > > for `array_intersect`: > ``` > array_intersect([], []) = [] > array_intersect([], null) = [] > array_intersect(null, []) = [] > array_intersect(null, null) = null > ``` I find this confusing as usually with other functions we return null if either input is null. And recent PR: - https://github.com/apache/datafusion/pull/19415 Now implements that for array inputs (for union and intersect) we return null in a row if either input list for the row is null (scalar null inputs still correspond to above behaviour as that was a separate code path). ## Spark behaviour Consistently return null if either input is null: ```python >>> spark.sql("select array_except(cast(null as array<int>), array(1))").show() +----------------------------+ |array_except(NULL, array(1))| +----------------------------+ | NULL| +----------------------------+ >>> spark.sql("select array_except(array(1), cast(null as array<int>))").show() +----------------------------+ |array_except(array(1), NULL)| +----------------------------+ | NULL| +----------------------------+ >>> spark.sql("select array_union(cast(null as array<int>), array(1))").show() +---------------------------+ |array_union(NULL, array(1))| +---------------------------+ | NULL| +---------------------------+ >>> spark.sql("select array_union(array(1), cast(null as array<int>))").show() +---------------------------+ |array_union(array(1), NULL)| +---------------------------+ | NULL| +---------------------------+ >>> spark.sql("select array_intersect(cast(null as array<int>), array(1))").show() +-------------------------------+ |array_intersect(NULL, array(1))| +-------------------------------+ | NULL| +-------------------------------+ >>> spark.sql("select array_intersect(array(1), cast(null as array<int>))").show() +-------------------------------+ |array_intersect(array(1), NULL)| +-------------------------------+ | NULL| +-------------------------------+ ``` ## Clickhouse behaviour Consistently return null if either input is null: ```sql :) select arrayExcept(null, [1]); SELECT arrayExcept(NULL, [1]) Query id: ca4fab63-30c0-4e45-8358-b9d23e495357 ┌─arrayExcept(NULL, [1])─┐ 1. │ ᴺᵁᴸᴸ │ └────────────────────────┘ 1 row in set. Elapsed: 0.002 sec. :) select arrayExcept([1], null); SELECT arrayExcept([1], NULL) Query id: 4770abf8-0de1-47b7-889c-0bcb400a232f ┌─arrayExcept([1], NULL)─┐ 1. │ ᴺᵁᴸᴸ │ └────────────────────────┘ 1 row in set. Elapsed: 0.001 sec. :) select arrayUnion(null, [1]); SELECT arrayUnion(NULL, [1]) Query id: e8c75415-2e0c-4a65-a7be-69dbd81f4546 ┌─arrayUnion(NULL, [1])─┐ 1. │ ᴺᵁᴸᴸ │ └───────────────────────┘ 1 row in set. Elapsed: 0.001 sec. :) select arrayUnion([1], null); SELECT arrayUnion([1], NULL) Query id: 21aabafd-8995-4a42-b7b9-4eb7e66746fd ┌─arrayUnion([1], NULL)─┐ 1. │ ᴺᵁᴸᴸ │ └───────────────────────┘ 1 row in set. Elapsed: 0.001 sec. :) select arrayIntersect(null, [1]); SELECT arrayIntersect(NULL, [1]) Query id: 3c161ed4-a244-4793-957f-925cdf6e5e0e ┌─arrayIntersect(NULL, [1])─┐ 1. │ ᴺᵁᴸᴸ │ └───────────────────────────┘ 1 row in set. Elapsed: 0.002 sec. :) select arrayIntersect([1], null); SELECT arrayIntersect([1], NULL) Query id: 9e486d9a-6129-40ea-947f-e8f63e7957e4 ┌─arrayIntersect([1], NULL)─┐ 1. │ ᴺᵁᴸᴸ │ └───────────────────────────┘ 1 row in set. Elapsed: 0.002 sec. ``` ## DuckDB behaviour Only supports array_intersect, and nullability depends on first argument: ```sql D select array_intersect(null, [1]); ┌───────────────────────────────────────────┐ │ array_intersect(NULL, main.list_value(1)) │ │ int32 │ ├───────────────────────────────────────────┤ │ NULL │ └───────────────────────────────────────────┘ D select array_intersect([1], null); ┌───────────────────────────────────────────┐ │ array_intersect(main.list_value(1), NULL) │ │ int32[] │ ├───────────────────────────────────────────┤ │ [] │ └───────────────────────────────────────────┘ ``` I believe with Spark & Clickhouse always returning null, and considering for most other functions we would return null, I think we should change the array union/intersect/except functions to return null if either input is null, whether that input is via a scalar null (with datatype null) or array input where the row being processed has a null value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
