Convention 2 seems more correct to me; if a UnionArray cannot contain top level nulls then a UnionScalar should not be nullable. Furthermore, I think that it's reasonable for `MakeNullScalar(some_union_type)->is_valid` to be true, though the doccomment for both `MakeNullScalar` and `MakeArrayOfNull` should include explicit warnings of the special case which unions represent.
It's worth noting that convention 1 doesn't round trip through the scalar. Consider the type t = dense_union({field("i", int8()), field("b", boolean())}, /*type_codes=*/{4, 8}); If we define an array of this type with a single element like so: a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])"); then the scalar returned by `a.GetScalar(0)` is one of: conv_1 = { .is_valid = false, .value = nullptr } conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) } ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we produce a correctly round tripped array with type_codes of 8. On the other hand, broadcasting `conv_1` produces ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])"); (In general replacing the type code with whichever type code was declared first.) On Thu, Jul 29, 2021 at 7:00 AM Antoine Pitrou <anto...@python.org> wrote: > > Hello, > > The Scalar base class has a `bool is_valid` member that is used to > represent null scalars for all types (including the null type). > > A UnionScalar, since it inherits from Scalar, has the following de facto > structure: > > { > // whether the scalar is null > bool is_valid; > // the underlying union value (only set if `is_valid` is true) > std::shared_ptr<Scalar> value; > } > > However, union arrays don't have a top-level validity bitmap. A null in > an union array is always encoded as a "valid" top-level entry with a > null entry in the chosen child array. > > Therefore, there are two possible conventions for representing null > union scalars: > > 1) set `scalar.is_valid` to false and `scalar.value` to nullptr > > 2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for > one the union's child types > > Advantages / drawbacks of convention 1 > -------------------------------------- > > + consistency: `MakeNullScalar(union type)` always returns a scalar with > `is_valid` set to false > > - `union_array.GetScalar(i)` can return a null scalar even when > `union_array.IsNull(i)` would return false > > Advantages / drawbacks of convention 2 > -------------------------------------- > > + more congruent with the union datatype definition > > + `union_array.GetScalar(i)` always returns a valid scalar, just like > `union_array.IsNull(i)` always returns false > > - `MakeNullScalar(union type)` returns a valid scalar (or should it > return an error??) > > - breaks compatibility, since the current behaviour follows convention 1 > (but UnionScalar is probably not widely used...) > > > This came in the context of https://github.com/apache/arrow/pull/10817, > but is unrelated to the changes in this PR. > > Thoughts? Opinions? > > Regards > > Antoine. >