I think this should be an extension type, yes. It could be parametrized on the storage type; the other system might at least know that one type is based on another (e.g. a user defined type). Type metadata can be preserved in the extension type's metadata.
I think it would be good to have standard UUID and JSON extension types. I don't think anyone is actively working on it. On Thu, Apr 11, 2024, at 05:55, Wes McKinney wrote: > In the past we have discussed adding a canonical type for UUID and JSON. I > still think this is a good idea and could improve ergonomics in downstream > language bindings (e.g. by exposing JSON querying function or automatically > boxing UUIDs in built-in UUID types, like the Python uuid library). Has > anyone done any work on this to anyone's knowledge? > > On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Hi Norman, >> Arrow has a concept of extension types [1] along with the possibility of >> proposing new canonical extension types [2]. This seems to cover the >> use-cases you mention but I might be misunderstanding? >> >> Thanks, >> Micah >> >> [1] >> >> https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types >> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html >> >> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan >> <norman.jor...@improving.com.invalid> wrote: >> >> > Problem Description >> > >> > Currently Arrow schemas can only contain columns of types supported by >> > Arrow. In some cases an Arrow schema maps to an external schema. This can >> > result in the Arrow schema not being able to support all the columns from >> > the external schema. >> > >> > Consider an external system that contains a column of type UUID. To model >> > the schema in Arrow, the user has two choices: >> > >> > 1. Do not include the UUID column in the Arrow schema >> > >> > 2. Map the column to an existing Arrow type. This will not include the >> > original type information. A UUID can be mapped to a FixedSizeBinary, but >> > consumers of the Arrow schema will be unable to distinguish a >> > FixedSizeBinary field from a UUID field. >> > >> > Possible Solution >> > >> > * Add a new type code that represents unsupported types >> > >> > * Values for the new type are represented as variable length binary >> > >> > Some drivers can expose data even when they don’t understand the data >> > type. For example, the PostgreSQL driver will return the raw bytes for >> > fields of an unknown type. Using an explicit type lets clients know that >> > they should convert values if they were able to determine the actual data >> > type. >> > >> > Questions >> > >> > * What is the impact on existing clients when they encounter fields >> of >> > the unsupported type? >> > >> > * Is it safe to assume that all unsupported values can safely be >> > converted to a variable length binary? >> > >> > * How can we preserve information about the original type? >> > >> > >>