Convention 2 seems more correct to me; if a UnionArray cannot
contain top level nulls then a UnionScalar should not be nullable.
Furthermore, I think that it's reasonable for
`MakeNullScalar(some_union_type)->is_valid` to be true, though
the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
should include explicit warnings of the special case which unions
represent.

It's worth noting that convention 1 doesn't round trip through the
scalar. Consider the type

    t = dense_union({field("i", int8()), field("b", boolean())},
/*type_codes=*/{4, 8});

If we define an array of this type with a single element like so:

    a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");

then the scalar returned by `a.GetScalar(0)` is one of:

    conv_1 = { .is_valid = false, .value = nullptr }
    conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }

... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
produce a correctly round tripped array with type_codes of 8. On the other
hand, broadcasting `conv_1` produces

    ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");

(In general replacing the type code with whichever type code was
declared first.)



On Thu, Jul 29, 2021 at 7:00 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> The Scalar base class has a `bool is_valid` member that is used to
> represent null scalars for all types (including the null type).
>
> A UnionScalar, since it inherits from Scalar, has the following de facto
> structure:
>
> {
>    // whether the scalar is null
>    bool is_valid;
>    // the underlying union value (only set if `is_valid` is true)
>    std::shared_ptr<Scalar> value;
> }
>
> However, union arrays don't have a top-level validity bitmap.  A null in
> an union array is always encoded as a "valid" top-level entry with a
> null entry in the chosen child array.
>
> Therefore, there are two possible conventions for representing null
> union scalars:
>
> 1) set `scalar.is_valid` to false and `scalar.value` to nullptr
>
> 2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for
> one the union's child types
>
> Advantages / drawbacks of convention 1
> --------------------------------------
>
> + consistency: `MakeNullScalar(union type)` always returns a scalar with
> `is_valid` set to false
>
> - `union_array.GetScalar(i)` can return a null scalar even when
> `union_array.IsNull(i)` would return false
>
> Advantages / drawbacks of convention 2
> --------------------------------------
>
> + more congruent with the union datatype definition
>
> + `union_array.GetScalar(i)` always returns a valid scalar, just like
> `union_array.IsNull(i)` always returns false
>
> - `MakeNullScalar(union type)` returns a valid scalar (or should it
> return an error??)
>
> - breaks compatibility, since the current behaviour follows convention 1
> (but UnionScalar is probably not widely used...)
>
>
> This came in the context of https://github.com/apache/arrow/pull/10817,
> but is unrelated to the changes in this PR.
>
> Thoughts? Opinions?
>
> Regards
>
> Antoine.
>

Reply via email to