On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <ben.kietz...@rstudio.com> wrote:
>
> In this scenario option A (include child arrays for each child type, even
> if that type is not observed) seems like the clearly correct choice to me.
> It yiedls a more intuitive layout for the union array and incurs no runtime
> overhead (since the absent children are empty/null arrays).

I am not sure this is right. The child arrays still occupy memory in
the Sparse Union case (where all child arrays have the same length).
In order to satisfy the requirements of the IPC protocol, the child
arrays need to be of the same type as the types in the union. In the
Dense Union case, the not-present children will have length 0.

>
> > why not allow them to be flexible in this regard?
>
> I would say that if code doesn't add anything except cognitive overhead
> then it's worthwhile to remove it.

The cognitive overhead comes for the Arrow library implementer --
users of the libraries aren't required to deal with this detail
necessarily. The type ids are optional, after all. Even if it is
removed, you still have ids, so whether it's

type 0, id=0
type 1, id=1
type 2, id=2

or

type 0, id=3
type 1, id=7
type 2, id=10

the difference is in the second case, you have to look up the code
corresponding to each type rather than assuming that the type's
position and its code are the same.

In processing, branching should occur at the Type level, so a function
to process a child looks like

ProcessChild(child, child_id, ...)

In either case you have to match a child with its id that appears in the data.

Anyway, since Julien and I are responsible for introducing this
concept in the early stages of the project I'm interested to hear more
from others. Note that this doesn't serve to resolve the
Union-of-Nested-Types problem that has prevented the development of
integration tests between Java and C++.

>
> On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Ben,
> >
> > Some applications use static type ids for various data types. Let's
> > consider one possibility:
> >
> > BOOLEAN: 0
> > INT32: 1
> > DOUBLE: 2
> > STRING (UTF8): 3
> >
> > If you were parsing JSON and constructing unions while parsing, you
> > might encounter some types, but not all. So if we _don't_ have the
> > option of having type ids in the metadata then we are left with some
> > unsatisfactory options:
> >
> > A: Include all types in the resulting union, even if they are unobserved,
> > or
> > B: Assign type id dynamically to types when they are observed
> >
> > Option B: is potentially bad because it does not parallelize across
> > threads or nodes.
> >
> > So I do think the feature is useful. It does make the implementations
> > of unions more complex, though, so it does not come without cost. But
> > unions are already the most complex tool we have in our nested data
> > toolbox, so why not allow them to be flexible in this regard?
> >
> > In any case I'm -0 on making changes, but would be interested in
> > feedback of others if there is strong sentiment about deprecating the
> > feature.
> >
> > - Wes
> >
> > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <ben.kietz...@rstudio.com>
> > wrote:
> > >
> > > The Union.typeIds property is confusing and its utility is unclear. I'd
> > > like to remove it (or at least document it better). Unless anyone knows a
> > > real advantage for keeping it I plan to assemble a PR to drop it from the
> > > format and the C++ implementation.
> > >
> > > ARROW-257 ( resolved by pull request
> > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > optional
> > > typeIds property (in the C++ implementation, this is
> > > UnionType::type_codes). Prior to that pull request each element (int8) in
> > > the type_ids (second) buffer of a union array was the index of a child
> > > array. Thus a type_ids buffer beginning with 5 indicated that the union
> > > array began with a value from child_data[5]. After that change to
> > interpret
> > > a type_id of 5 one must look through the typeIds property and the index
> > at
> > > which a 5 is found is the index of the corresponding child array.
> > >
> > > The change was made to allow unused child arrays to be dropped; for
> > example
> > > if a union type were predefined with 64 members then an array of nearly
> > > identical type containing only int32 and utf8 values would only be
> > required
> > > to have two child arrays. Note: the union types are not exactly identical
> > > even though they contain identical members; their typeIds properties will
> > > differ.
> > >
> > > However unused child arrays can be replaced by null arrays (which are
> > > almost equally lightweight as they require no heap allocation). I'm also
> > > unaware of a use case for predefined type_ids; if they are application
> > > specific then I think it's out of scope for arrow to maintain a
> > child_index
> > > <-> type_id mapping. It seems that the optimization has questionable
> > merit
> > > and does not warrant the added complexity.
> >

Reply via email to