> Sidenote: I haven't seen many proposals for canonical extension types
> until now, which is a bit surprising. The barrier for standardizing a
> canonical extension type is much lower than for a new Arrow data type.

Chiming in here, since I've done some exploration implementing bfloat16 as
an extension type. The reason I haven't proposed a canonical extension type
is that I've found the ecosystem I work in doesn't feel ready for this to
really be implemented as an extension type. To give a few examples:

 * In PyArrow, the ChunkedArray repr can't be extended, so users always see
the extension type. This does not provide a good user experience when the
storage type is fixed size binary, but means something other than binary
data. See example in [1].
* In arrow-rs / datafusion, there isn't any explicit support for extension
types. They only exist as metadata fields right now. There is some
discussion of adding ExtensionType to the DataType enum, but that has
been found to be an unacceptably large refactor for now. [2] There is
instead an effort to contemplate supporting them in Datafusion [3].

So, as I see it, the practice barrier to extension types isn't just
defining canonical ones but also improving the ecosystem to better support
them. I don't think these issues are insurmountable, but just will take a
little more work. In fact, if we put effort into getting just one extension
type like this well supported in the Arrow ecosystem, I think the path for
additional extension types would be rather easy. I was hoping to make some
progress on these myself, but I have had to focus elsewhere. For now, it
seems like getting a data type formally supported is still an easier path
than the extension type path, if you really care about the user experience
being good.

Best,

Will Jones


[1] https://github.com/apache/arrow/issues/36648
[2] https://github.com/apache/arrow-rs/issues/4472
[3] https://github.com/apache/arrow-datafusion/issues/7923

On Thu, Nov 9, 2023 at 9:39 AM Curt Hagenlocher <c...@hagenlocher.org>
wrote:

> It certainly could be. Would float16 be done as a canonical extension type
> if it were proposed today?
>
> On Thu, Nov 9, 2023 at 9:36 AM David Li <lidav...@apache.org> wrote:
>
> > cuDF has decimal32/decimal64 [1].
> >
> > Would a canonical extension type [2] be appropriate here? I think that's
> > come up as a solution before.
> >
> > [1]: https://docs.rapids.ai/api/cudf/stable/user_guide/data-types/
> > [2]: https://arrow.apache.org/docs/format/CanonicalExtensions.html
> >
> > On Thu, Nov 9, 2023, at 11:56, Antoine Pitrou wrote:
> > > Or they could trivially use a int64 column for that, since the scale is
> > > fixed anyway, and you're probably not going to multiply money values
> > > together.
> > >
> > >
> > > Le 09/11/2023 à 17:54, Curt Hagenlocher a écrit :
> > >> If Arrow had a decimal64 type, someone could choose to use that for a
> > >> PostgreSQL money column knowing that there are edge cases where they
> may
> > >> get an undesired result.
> > >>
> > >> On Thu, Nov 9, 2023 at 8:42 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >>
> > >>>
> > >>> Le 09/11/2023 à 17:23, Curt Hagenlocher a écrit :
> > >>>> Or more succinctly,
> > >>>> "111,111,111,111,111.1111" will fit into a decimal64; would you
> > prevent
> > >>> it
> > >>>> from being stored in one so that you can describe the column as
> > >>>> "decimal(18, 4)"?
> > >>>
> > >>> That's what we do for other decimal types, see PyArrow below:
> > >>> ```
> > >>>   >>> pa.array([111_111_111_111_111_1111]).cast(pa.decimal128(18, 0))
> > >>> Traceback (most recent call last):
> > >>>     [...]
> > >>> ArrowInvalid: Precision is not great enough for the result. It should
> > be
> > >>> at least 19
> > >>> ```
> > >>>
> > >>>
> > >>
> >
>

Reply via email to