Depending where your Arrow-encoded data is used, either extension types or generic field metadata are options. We have this problem in the ADBC Postgres driver, where we can convert *most* Postgres types to an Arrow type but there are some others where we can't or don't know or don't implement a conversion. Currently for these we return opaque binary (the Postgres COPY representation of the value) but put field metadata so that a consumer can implement a workaround for an unsupported type. It would be arguably better to have implemented this as an extension type; however, field metadata felt like less of a commitment when I first worked on this.
Cheers, -dewey On Thu, Apr 11, 2024 at 1:20 PM Norman Jordan <norman.jor...@improving.com.invalid> wrote: > > I was using UUID as an example. It looks like extension types covers my > original request. > ________________________________ > From: Felipe Oliveira Carvalho <felipe...@gmail.com> > Sent: Thursday, April 11, 2024 7:15 AM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Subject: Re: Unsupported/Other Type > > The OP used UUID as an example. Would that be enough or the request is for > a flexible mechanism that allows the creation of one-off nominal types for > very specific use-cases? > > — > Felipe > > On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou <anto...@python.org> wrote: > > > > > Yes, JSON and UUID are obvious candidates for new canonical extension > > types. XML also comes to mind, but I'm not sure there's much of a use > > case for it. > > > > Regards > > > > Antoine. > > > > > > Le 10/04/2024 à 22:55, Wes McKinney a écrit : > > > In the past we have discussed adding a canonical type for UUID and JSON. > > I > > > still think this is a good idea and could improve ergonomics in > > downstream > > > language bindings (e.g. by exposing JSON querying function or > > automatically > > > boxing UUIDs in built-in UUID types, like the Python uuid library). Has > > > anyone done any work on this to anyone's knowledge? > > > > > > On Wed, Apr 10, 2024 at 3:05 PM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > >> Hi Norman, > > >> Arrow has a concept of extension types [1] along with the possibility of > > >> proposing new canonical extension types [2]. This seems to cover the > > >> use-cases you mention but I might be misunderstanding? > > >> > > >> Thanks, > > >> Micah > > >> > > >> [1] > > >> > > >> > > https://arrow.apache.org/docs/format/Columnar.html#format-metadata-extension-types > > >> [2] https://arrow.apache.org/docs/format/CanonicalExtensions.html > > >> > > >> On Wed, Apr 10, 2024 at 11:44 AM Norman Jordan > > >> <norman.jor...@improving.com.invalid> wrote: > > >> > > >>> Problem Description > > >>> > > >>> Currently Arrow schemas can only contain columns of types supported by > > >>> Arrow. In some cases an Arrow schema maps to an external schema. This > > can > > >>> result in the Arrow schema not being able to support all the columns > > from > > >>> the external schema. > > >>> > > >>> Consider an external system that contains a column of type UUID. To > > model > > >>> the schema in Arrow, the user has two choices: > > >>> > > >>> 1. Do not include the UUID column in the Arrow schema > > >>> > > >>> 2. Map the column to an existing Arrow type. This will not include > > the > > >>> original type information. A UUID can be mapped to a FixedSizeBinary, > > but > > >>> consumers of the Arrow schema will be unable to distinguish a > > >>> FixedSizeBinary field from a UUID field. > > >>> > > >>> Possible Solution > > >>> > > >>> * Add a new type code that represents unsupported types > > >>> > > >>> * Values for the new type are represented as variable length > > binary > > >>> > > >>> Some drivers can expose data even when they don’t understand the data > > >>> type. For example, the PostgreSQL driver will return the raw bytes for > > >>> fields of an unknown type. Using an explicit type lets clients know > > that > > >>> they should convert values if they were able to determine the actual > > data > > >>> type. > > >>> > > >>> Questions > > >>> > > >>> * What is the impact on existing clients when they encounter > > fields > > >> of > > >>> the unsupported type? > > >>> > > >>> * Is it safe to assume that all unsupported values can safely be > > >>> converted to a variable length binary? > > >>> > > >>> * How can we preserve information about the original type? > > >>> > > >>> > > >> > > > > > > Warning: The sender of this message could not be validated and may not be the > actual sender.