hi Ben, Some applications use static type ids for various data types. Let's consider one possibility:
BOOLEAN: 0 INT32: 1 DOUBLE: 2 STRING (UTF8): 3 If you were parsing JSON and constructing unions while parsing, you might encounter some types, but not all. So if we _don't_ have the option of having type ids in the metadata then we are left with some unsatisfactory options: A: Include all types in the resulting union, even if they are unobserved, or B: Assign type id dynamically to types when they are observed Option B: is potentially bad because it does not parallelize across threads or nodes. So I do think the feature is useful. It does make the implementations of unions more complex, though, so it does not come without cost. But unions are already the most complex tool we have in our nested data toolbox, so why not allow them to be flexible in this regard? In any case I'm -0 on making changes, but would be interested in feedback of others if there is strong sentiment about deprecating the feature. - Wes On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <ben.kietz...@rstudio.com> wrote: > > The Union.typeIds property is confusing and its utility is unclear. I'd > like to remove it (or at least document it better). Unless anyone knows a > real advantage for keeping it I plan to assemble a PR to drop it from the > format and the C++ implementation. > > ARROW-257 ( resolved by pull request > https://github.com/apache/arrow/pull/143 ) extended Unions with an optional > typeIds property (in the C++ implementation, this is > UnionType::type_codes). Prior to that pull request each element (int8) in > the type_ids (second) buffer of a union array was the index of a child > array. Thus a type_ids buffer beginning with 5 indicated that the union > array began with a value from child_data[5]. After that change to interpret > a type_id of 5 one must look through the typeIds property and the index at > which a 5 is found is the index of the corresponding child array. > > The change was made to allow unused child arrays to be dropped; for example > if a union type were predefined with 64 members then an array of nearly > identical type containing only int32 and utf8 values would only be required > to have two child arrays. Note: the union types are not exactly identical > even though they contain identical members; their typeIds properties will > differ. > > However unused child arrays can be replaced by null arrays (which are > almost equally lightweight as they require no heap allocation). I'm also > unaware of a use case for predefined type_ids; if they are application > specific then I think it's out of scope for arrow to maintain a child_index > <-> type_id mapping. It seems that the optimization has questionable merit > and does not warrant the added complexity.