samuelcolvin commented on issue #11162: URL: https://github.com/apache/datafusion/issues/11162#issuecomment-2212326030
See #11314 as a demonstration of the problem for both dense and sparse unions. After a bit of investigation, the issues lies in the first instance with https://github.com/apache/datafusion/blob/08c5345e932f1c5c948751e0d06b1fd99e174efa/datafusion/physical-expr/src/expressions/is_null.rs#L74-L84 Then with [this code](https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-arith/src/boolean.rs#L314-L332) in `arrow-rs`: ```rs /// Returns a non-null [BooleanArray] with whether each value of the array is null. /// # Error /// This function never errors. /// # Example /// ... pub fn is_null(input: &dyn Array) -> Result<BooleanArray, ArrowError> { let values = match input.logical_nulls() { None => BooleanBuffer::new_unset(input.len()), Some(nulls) => !nulls.inner(), }; Ok(BooleanArray::new(values, None)) } ``` And then with [this code](https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-array/src/array/union_array.rs#L482-L486) ```rs /// Union types always return non null as there is no validity buffer. /// To check validity correctly you must check the underlying vector. fn is_null(&self, _index: usize) -> bool { false } ``` Ultimately with [the spec](https://github.com/apache/arrow/blob/674e70891d1b3bc82b025d9c434d8ff1aa4c877e/docs/source/format/Columnar.rst?plain=1#L862-L864) > Unlike other data types, unions do not have their own validity bitmap. Instead, > the nullness of each slot is determined exclusively by the child arrays which > are composed to create the union. --- Basically arrow is saying "we're not going to tell you if a union is null, you need to look in the child arrays", but datafusion isn't listening and is just asking the union if it's null in the naive way. Two options to move forward as far as I can tell: 1. Decide unions in DF can never be null — I'll need to abandon unions in `datafusion-functions-json` and just return strings everywhere 2. Have custom logic for unions that looks up the child array to determine if the value is null If (as I hope) we go for the second option, there's also the issue (as demonstrated by #11314) that the representation of "null" union items doesn't match other types, it shows `{A=}` instead of an empty string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org