Hi Paul, TL;DR; I think the the typeIds field you referenced is not the offset for dense vectors mentioned by the spec. I believe (but lack the historical context) that it is an outgrowth of the Java implementation that might be useful in other contexts.
The requirement is that typeIDs field you referenced is that has a less length less the 127, the bit-width of the ID is immaterial. Also, the typeIDs field and unions aren't fully supported yet. There is an open PR [1] which got stalled on performance and long term direction concerns. I haven't fully validated this, but my rough understanding is that the Java implementation assumes only one array/vector of each type is in a union. Roughly, each logical type + Schema.fbs enum parameterization has its own type with its own type ID (I think the number is still less 127 but might grow larger). The implementation makes use of this fact to do some optimizations. So when a union (I think only Sparse is supported in Java) serializes itself it records each of the type IDs [2] so it can easily map back to them. [1] https://github.com/apache/arrow/pull/987 [2] https://github.com/apache/arrow/blob/73d379f4631cd3013371f60876a52615171e6c3b/java/vector/src/main/codegen/templates/UnionVector.java#L329 On Wed, Mar 20, 2019 at 1:08 AM Paul Taylor <[email protected]> wrote: > I noticed the the DenseUnion docs[1] says the typeIds buffer is 8-bit > signed integers, but in the flatbuffer schema[2] it's typed as int (and > flatc generates a function that returns an Int32Array). > > How are the other implementations treating this buffer, and should we > update the docs or the flatbuffers schema? > > Thanks, > > Paul > > 1. https://arrow.apache.org/docs/format/Layout.html#dense-union-type > > 2. > > https://github.com/apache/arrow/blob/50bc9f49671afb56594910f49b9bf34e080a70e7/format/Schema.fbs#L92 > >
