Thanks all, this is helpful and I've added
https://issues.apache.org/jira/browse/ARROW-5933 to improve the
documentation for future developers.

On Wed, Jul 10, 2019 at 11:09 PM Jacques Nadeau <jacq...@apache.org> wrote:

> I was also supportive of this pattern. We definitely have used it before to
> optimize in certain cases.
>
> On Wed, Jul 10, 2019, 2:40 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <ben.kietz...@rstudio.com>
> > wrote:
> > >
> > > In this scenario option A (include child arrays for each child type,
> even
> > > if that type is not observed) seems like the clearly correct choice to
> > me.
> > > It yiedls a more intuitive layout for the union array and incurs no
> > runtime
> > > overhead (since the absent children are empty/null arrays).
> >
> > I am not sure this is right. The child arrays still occupy memory in
> > the Sparse Union case (where all child arrays have the same length).
> > In order to satisfy the requirements of the IPC protocol, the child
> > arrays need to be of the same type as the types in the union. In the
> > Dense Union case, the not-present children will have length 0.
> >
> > >
> > > > why not allow them to be flexible in this regard?
> > >
> > > I would say that if code doesn't add anything except cognitive overhead
> > > then it's worthwhile to remove it.
> >
> > The cognitive overhead comes for the Arrow library implementer --
> > users of the libraries aren't required to deal with this detail
> > necessarily. The type ids are optional, after all. Even if it is
> > removed, you still have ids, so whether it's
> >
> > type 0, id=0
> > type 1, id=1
> > type 2, id=2
> >
> > or
> >
> > type 0, id=3
> > type 1, id=7
> > type 2, id=10
> >
> > the difference is in the second case, you have to look up the code
> > corresponding to each type rather than assuming that the type's
> > position and its code are the same.
> >
> > In processing, branching should occur at the Type level, so a function
> > to process a child looks like
> >
> > ProcessChild(child, child_id, ...)
> >
> > In either case you have to match a child with its id that appears in the
> > data.
> >
> > Anyway, since Julien and I are responsible for introducing this
> > concept in the early stages of the project I'm interested to hear more
> > from others. Note that this doesn't serve to resolve the
> > Union-of-Nested-Types problem that has prevented the development of
> > integration tests between Java and C++.
> >
> > >
> > > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > hi Ben,
> > > >
> > > > Some applications use static type ids for various data types. Let's
> > > > consider one possibility:
> > > >
> > > > BOOLEAN: 0
> > > > INT32: 1
> > > > DOUBLE: 2
> > > > STRING (UTF8): 3
> > > >
> > > > If you were parsing JSON and constructing unions while parsing, you
> > > > might encounter some types, but not all. So if we _don't_ have the
> > > > option of having type ids in the metadata then we are left with some
> > > > unsatisfactory options:
> > > >
> > > > A: Include all types in the resulting union, even if they are
> > unobserved,
> > > > or
> > > > B: Assign type id dynamically to types when they are observed
> > > >
> > > > Option B: is potentially bad because it does not parallelize across
> > > > threads or nodes.
> > > >
> > > > So I do think the feature is useful. It does make the implementations
> > > > of unions more complex, though, so it does not come without cost. But
> > > > unions are already the most complex tool we have in our nested data
> > > > toolbox, so why not allow them to be flexible in this regard?
> > > >
> > > > In any case I'm -0 on making changes, but would be interested in
> > > > feedback of others if there is strong sentiment about deprecating the
> > > > feature.
> > > >
> > > > - Wes
> > > >
> > > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <
> ben.kietz...@rstudio.com
> > >
> > > > wrote:
> > > > >
> > > > > The Union.typeIds property is confusing and its utility is unclear.
> > I'd
> > > > > like to remove it (or at least document it better). Unless anyone
> > knows a
> > > > > real advantage for keeping it I plan to assemble a PR to drop it
> > from the
> > > > > format and the C++ implementation.
> > > > >
> > > > > ARROW-257 ( resolved by pull request
> > > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > > > optional
> > > > > typeIds property (in the C++ implementation, this is
> > > > > UnionType::type_codes). Prior to that pull request each element
> > (int8) in
> > > > > the type_ids (second) buffer of a union array was the index of a
> > child
> > > > > array. Thus a type_ids buffer beginning with 5 indicated that the
> > union
> > > > > array began with a value from child_data[5]. After that change to
> > > > interpret
> > > > > a type_id of 5 one must look through the typeIds property and the
> > index
> > > > at
> > > > > which a 5 is found is the index of the corresponding child array.
> > > > >
> > > > > The change was made to allow unused child arrays to be dropped; for
> > > > example
> > > > > if a union type were predefined with 64 members then an array of
> > nearly
> > > > > identical type containing only int32 and utf8 values would only be
> > > > required
> > > > > to have two child arrays. Note: the union types are not exactly
> > identical
> > > > > even though they contain identical members; their typeIds
> properties
> > will
> > > > > differ.
> > > > >
> > > > > However unused child arrays can be replaced by null arrays (which
> are
> > > > > almost equally lightweight as they require no heap allocation). I'm
> > also
> > > > > unaware of a use case for predefined type_ids; if they are
> > application
> > > > > specific then I think it's out of scope for arrow to maintain a
> > > > child_index
> > > > > <-> type_id mapping. It seems that the optimization has
> questionable
> > > > merit
> > > > > and does not warrant the added complexity.
> > > >
> >
>

Reply via email to