hi Paul,

On Tue, Feb 26, 2019 at 1:16 PM Paul Taylor <ptay...@apache.org> wrote:
>
> An alternative that's worked for us is (ab)using single-child
> SparseUnions to represent custom types. We have an enum of "well-known"
> typeIds (UUID, vec2's, IP addresses, etc), whose data is stored in one
> of the known Arrow types, as you've done.
>
> Pros are the typeIds buffer is tiny, and doesn't require metadata
> propagation or string matching to maintain type information.
>
> Cons are this is really an abuse of the Union type, and since the typeId
> buffer is (implicitly?) an Int8, we can only have 255 extension types
> today. We don't have that many yet, but that could be an issue if this
> pattern were generalized to any number of custom types.
>
> I'm not sure how widely supported Unions are across the Arrow
> implementations or ecosystem (unsure about pandas, Rapids/cuDF no
> support yet), but maybe this pattern could work more generally if we
> defined an enum of "well-known" extension typeIds?

This seems a bit kludge-y to me in a general purpose setting (in
domain specific setting, it may be entirely appropriate, not for me to
decide).

I have heard several times over the last few years from business users
of Arrow which have proprietary / domain-specific data types. So even
if we had a common collection of "open source" extension types (e.g.
things like "uuid"), there is still a need to define metadata to
annotate built-in types to give different meaning (e.g. "special"
timestamps), and provide a way in Java, C++, Python, JS, etc. for the
right kind of user-defined container type to be constructed when IPC
messages are reconstructed.

- Wes

>
> Thanks,
>
> Paul
>
>
> On 2/25/19 3:32 PM, Wes McKinney wrote:
> > hi folks,
> >
> > I recently wrote a patch to propose a C++ API for user-defined "extension" 
> > types
> >
> > https://github.com/apache/arrow/pull/3694
> >
> > The idea is that an extension type wraps a pre-existing Arrow type.
> > For example a UUIDType can be represented as FixedSizeBinary(16). The
> > intent is that Arrow consumers which are not aware of an extension
> > type can ignore the additional type metadata and still interact with
> > the raw storage
> >
> > One question is how to permit such metadata to be preserved through
> > IPC / RPC messages (i.e., Schema.fbs) and how other languages can
> > interact with it. There are couple options:
> >
> > * What I implemented in my patch: use the Field-level custom_metadata
> > field with known key names "arrow_extension_name" and
> > "arrow_extension_data" for the type's unique identifier and serialized
> > form, respectively. If we opt for this, then we should add a section
> > to the specification to codify the convention used
> >
> > * Add a new field to the Field table in Schema.fbs
> >
> > The former is attractive in the sense that consumers who don't have
> > special handling for an extension type will carry along the Field
> > metadata in their schema, so it can be passed on in subsequent IPC
> > messages without writing any extra code.
> >
> > Thoughts about this? With a C++ implementation landing, it would be
> > great to identify a champion to create a Java implementation and also
> > add integration test support to ensure that consumers do not destroy
> > the extension type metadata for unrecognized types (i.e. if I send you
> > data that says it's "uuid" and you don't know what that is yet, you
> > preserve the metadata fields anyway).
> >
> > Thanks
> > Wes

Reply via email to