Re: [Discuss] Are Union.typeIds worth keeping?

Wes McKinney Wed, 10 Jul 2019 11:51:36 -0700

hi Ben,

Some applications use static type ids for various data types. Let's
consider one possibility:


BOOLEAN: 0
INT32: 1
DOUBLE: 2
STRING (UTF8): 3

If you were parsing JSON and constructing unions while parsing, you
might encounter some types, but not all. So if we _don't_ have the
option of having type ids in the metadata then we are left with some
unsatisfactory options:

A: Include all types in the resulting union, even if they are unobserved, or
B: Assign type id dynamically to types when they are observed

Option B: is potentially bad because it does not parallelize across
threads or nodes.

So I do think the feature is useful. It does make the implementations
of unions more complex, though, so it does not come without cost. But
unions are already the most complex tool we have in our nested data
toolbox, so why not allow them to be flexible in this regard?

In any case I'm -0 on making changes, but would be interested in
feedback of others if there is strong sentiment about deprecating the
feature.

- Wes

On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <ben.kietz...@rstudio.com> wrote:
>
> The Union.typeIds property is confusing and its utility is unclear. I'd
> like to remove it (or at least document it better). Unless anyone knows a
> real advantage for keeping it I plan to assemble a PR to drop it from the
> format and the C++ implementation.
>
> ARROW-257 ( resolved by pull request
> https://github.com/apache/arrow/pull/143 ) extended Unions with an optional
> typeIds property (in the C++ implementation, this is
> UnionType::type_codes). Prior to that pull request each element (int8) in
> the type_ids (second) buffer of a union array was the index of a child
> array. Thus a type_ids buffer beginning with 5 indicated that the union
> array began with a value from child_data[5]. After that change to interpret
> a type_id of 5 one must look through the typeIds property and the index at
> which a 5 is found is the index of the corresponding child array.
>
> The change was made to allow unused child arrays to be dropped; for example
> if a union type were predefined with 64 members then an array of nearly
> identical type containing only int32 and utf8 values would only be required
> to have two child arrays. Note: the union types are not exactly identical
> even though they contain identical members; their typeIds properties will
> differ.
>
> However unused child arrays can be replaced by null arrays (which are
> almost equally lightweight as they require no heap allocation). I'm also
> unaware of a use case for predefined type_ids; if they are application
> specific then I think it's out of scope for arrow to maintain a child_index
> <-> type_id mapping. It seems that the optimization has questionable merit
> and does not warrant the added complexity.

Re: [Discuss] Are Union.typeIds worth keeping?

Reply via email to