hi Paul, On Tue, Feb 26, 2019 at 1:16 PM Paul Taylor <ptay...@apache.org> wrote: > > An alternative that's worked for us is (ab)using single-child > SparseUnions to represent custom types. We have an enum of "well-known" > typeIds (UUID, vec2's, IP addresses, etc), whose data is stored in one > of the known Arrow types, as you've done. > > Pros are the typeIds buffer is tiny, and doesn't require metadata > propagation or string matching to maintain type information. > > Cons are this is really an abuse of the Union type, and since the typeId > buffer is (implicitly?) an Int8, we can only have 255 extension types > today. We don't have that many yet, but that could be an issue if this > pattern were generalized to any number of custom types. > > I'm not sure how widely supported Unions are across the Arrow > implementations or ecosystem (unsure about pandas, Rapids/cuDF no > support yet), but maybe this pattern could work more generally if we > defined an enum of "well-known" extension typeIds?
This seems a bit kludge-y to me in a general purpose setting (in domain specific setting, it may be entirely appropriate, not for me to decide). I have heard several times over the last few years from business users of Arrow which have proprietary / domain-specific data types. So even if we had a common collection of "open source" extension types (e.g. things like "uuid"), there is still a need to define metadata to annotate built-in types to give different meaning (e.g. "special" timestamps), and provide a way in Java, C++, Python, JS, etc. for the right kind of user-defined container type to be constructed when IPC messages are reconstructed. - Wes > > Thanks, > > Paul > > > On 2/25/19 3:32 PM, Wes McKinney wrote: > > hi folks, > > > > I recently wrote a patch to propose a C++ API for user-defined "extension" > > types > > > > https://github.com/apache/arrow/pull/3694 > > > > The idea is that an extension type wraps a pre-existing Arrow type. > > For example a UUIDType can be represented as FixedSizeBinary(16). The > > intent is that Arrow consumers which are not aware of an extension > > type can ignore the additional type metadata and still interact with > > the raw storage > > > > One question is how to permit such metadata to be preserved through > > IPC / RPC messages (i.e., Schema.fbs) and how other languages can > > interact with it. There are couple options: > > > > * What I implemented in my patch: use the Field-level custom_metadata > > field with known key names "arrow_extension_name" and > > "arrow_extension_data" for the type's unique identifier and serialized > > form, respectively. If we opt for this, then we should add a section > > to the specification to codify the convention used > > > > * Add a new field to the Field table in Schema.fbs > > > > The former is attractive in the sense that consumers who don't have > > special handling for an extension type will carry along the Field > > metadata in their schema, so it can be passed on in subsequent IPC > > messages without writing any extra code. > > > > Thoughts about this? With a C++ implementation landing, it would be > > great to identify a champion to create a Java implementation and also > > add integration test support to ensure that consumers do not destroy > > the extension type metadata for unrecognized types (i.e. if I send you > > data that says it's "uuid" and you don't know what that is yet, you > > preserve the metadata fields anyway). > > > > Thanks > > Wes