I think this should be an extension type, yes.

It could be parametrized on the storage type; the other system might at least 
know that one type is based on another (e.g. a user defined type). Type 
metadata can be preserved in the extension type's metadata.

I think it would be good to have standard UUID and JSON extension types. I 
don't think anyone is actively working on it. 

On Thu, Apr 11, 2024, at 05:55, Wes McKinney wrote:
> In the past we have discussed adding a canonical type for UUID and JSON. I
> still think this is a good idea and could improve ergonomics in downstream
> language bindings (e.g. by exposing JSON querying function or automatically
> boxing UUIDs in built-in UUID types, like the Python uuid library). Has
> anyone done any work on this to anyone's knowledge?
>
> On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Norman,
>> Arrow has a concept of extension types [1] along with the possibility of
>> proposing new canonical extension types [2].  This seems to cover the
>> use-cases you mention but I might be misunderstanding?
>>
>> Thanks,
>> Micah
>>
>> [1]
>>
>> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types
>> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html
>>
>> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan
>> <norman.jor...@improving.com.invalid> wrote:
>>
>> > Problem Description
>> >
>> > Currently Arrow schemas can only contain columns of types supported by
>> > Arrow. In some cases an Arrow schema maps to an external schema. This can
>> > result in the Arrow schema not being able to support all the columns from
>> > the external schema.
>> >
>> > Consider an external system that contains a column of type UUID. To model
>> > the schema in Arrow, the user has two choices:
>> >
>> >   1.  Do not include the UUID column in the Arrow schema
>> >
>> >   2.  Map the column to an existing Arrow type. This will not include the
>> > original type information. A UUID can be mapped to a FixedSizeBinary, but
>> > consumers of the Arrow schema will be unable to distinguish a
>> > FixedSizeBinary field from a UUID field.
>> >
>> > Possible Solution
>> >
>> >   *   Add a new type code that represents unsupported types
>> >
>> >   *   Values for the new type are represented as variable length binary
>> >
>> > Some drivers can expose data even when they don’t understand the data
>> > type. For example, the PostgreSQL driver will return the raw bytes for
>> > fields of an unknown type. Using an explicit type lets clients know that
>> > they should convert values if they were able to determine the actual data
>> > type.
>> >
>> > Questions
>> >
>> >   *   What is the impact on existing clients when they encounter fields
>> of
>> > the unsupported type?
>> >
>> >   *   Is it safe to assume that all unsupported values can safely be
>> > converted to a variable length binary?
>> >
>> >   *   How can we preserve information about the original type?
>> >
>> >
>>

Reply via email to