Thanks all, this is helpful and I've added https://issues.apache.org/jira/browse/ARROW-5933 to improve the documentation for future developers.
On Wed, Jul 10, 2019 at 11:09 PM Jacques Nadeau <jacq...@apache.org> wrote: > I was also supportive of this pattern. We definitely have used it before to > optimize in certain cases. > > On Wed, Jul 10, 2019, 2:40 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <ben.kietz...@rstudio.com> > > wrote: > > > > > > In this scenario option A (include child arrays for each child type, > even > > > if that type is not observed) seems like the clearly correct choice to > > me. > > > It yiedls a more intuitive layout for the union array and incurs no > > runtime > > > overhead (since the absent children are empty/null arrays). > > > > I am not sure this is right. The child arrays still occupy memory in > > the Sparse Union case (where all child arrays have the same length). > > In order to satisfy the requirements of the IPC protocol, the child > > arrays need to be of the same type as the types in the union. In the > > Dense Union case, the not-present children will have length 0. > > > > > > > > > why not allow them to be flexible in this regard? > > > > > > I would say that if code doesn't add anything except cognitive overhead > > > then it's worthwhile to remove it. > > > > The cognitive overhead comes for the Arrow library implementer -- > > users of the libraries aren't required to deal with this detail > > necessarily. The type ids are optional, after all. Even if it is > > removed, you still have ids, so whether it's > > > > type 0, id=0 > > type 1, id=1 > > type 2, id=2 > > > > or > > > > type 0, id=3 > > type 1, id=7 > > type 2, id=10 > > > > the difference is in the second case, you have to look up the code > > corresponding to each type rather than assuming that the type's > > position and its code are the same. > > > > In processing, branching should occur at the Type level, so a function > > to process a child looks like > > > > ProcessChild(child, child_id, ...) > > > > In either case you have to match a child with its id that appears in the > > data. > > > > Anyway, since Julien and I are responsible for introducing this > > concept in the early stages of the project I'm interested to hear more > > from others. Note that this doesn't serve to resolve the > > Union-of-Nested-Types problem that has prevented the development of > > integration tests between Java and C++. > > > > > > > > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > hi Ben, > > > > > > > > Some applications use static type ids for various data types. Let's > > > > consider one possibility: > > > > > > > > BOOLEAN: 0 > > > > INT32: 1 > > > > DOUBLE: 2 > > > > STRING (UTF8): 3 > > > > > > > > If you were parsing JSON and constructing unions while parsing, you > > > > might encounter some types, but not all. So if we _don't_ have the > > > > option of having type ids in the metadata then we are left with some > > > > unsatisfactory options: > > > > > > > > A: Include all types in the resulting union, even if they are > > unobserved, > > > > or > > > > B: Assign type id dynamically to types when they are observed > > > > > > > > Option B: is potentially bad because it does not parallelize across > > > > threads or nodes. > > > > > > > > So I do think the feature is useful. It does make the implementations > > > > of unions more complex, though, so it does not come without cost. But > > > > unions are already the most complex tool we have in our nested data > > > > toolbox, so why not allow them to be flexible in this regard? > > > > > > > > In any case I'm -0 on making changes, but would be interested in > > > > feedback of others if there is strong sentiment about deprecating the > > > > feature. > > > > > > > > - Wes > > > > > > > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman < > ben.kietz...@rstudio.com > > > > > > > wrote: > > > > > > > > > > The Union.typeIds property is confusing and its utility is unclear. > > I'd > > > > > like to remove it (or at least document it better). Unless anyone > > knows a > > > > > real advantage for keeping it I plan to assemble a PR to drop it > > from the > > > > > format and the C++ implementation. > > > > > > > > > > ARROW-257 ( resolved by pull request > > > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an > > > > optional > > > > > typeIds property (in the C++ implementation, this is > > > > > UnionType::type_codes). Prior to that pull request each element > > (int8) in > > > > > the type_ids (second) buffer of a union array was the index of a > > child > > > > > array. Thus a type_ids buffer beginning with 5 indicated that the > > union > > > > > array began with a value from child_data[5]. After that change to > > > > interpret > > > > > a type_id of 5 one must look through the typeIds property and the > > index > > > > at > > > > > which a 5 is found is the index of the corresponding child array. > > > > > > > > > > The change was made to allow unused child arrays to be dropped; for > > > > example > > > > > if a union type were predefined with 64 members then an array of > > nearly > > > > > identical type containing only int32 and utf8 values would only be > > > > required > > > > > to have two child arrays. Note: the union types are not exactly > > identical > > > > > even though they contain identical members; their typeIds > properties > > will > > > > > differ. > > > > > > > > > > However unused child arrays can be replaced by null arrays (which > are > > > > > almost equally lightweight as they require no heap allocation). I'm > > also > > > > > unaware of a use case for predefined type_ids; if they are > > application > > > > > specific then I think it's out of scope for arrow to maintain a > > > > child_index > > > > > <-> type_id mapping. It seems that the optimization has > questionable > > > > merit > > > > > and does not warrant the added complexity. > > > > > > >