Thank you for opening the discussion here and opening it up!

I agree that attaching semantics as metadata and/or documenting them
in a central repository is an unreasonable burden to put on extension
type authors and Arrow implementations in general.

I also agree that operations other than filter/take/concatenate should
error by default: just because a storage type happens to be an
integer, it doesn't necessarily mean that arithmetic (for example) is
meaningful. (For example, an extension type implementing a bitpacked
uint64 such as an S2 cell or H3 index would result in an invalid value
for "plus one" or "times three").

For query engines and/or implementations with extensive compute
capability like Arrow C++, it is useful to be able to leverage those
for extension types: for the S2/H3 index example, it would be very
cool to allow a group_by + aggregate to "just work" (since ==/hash
*is* valid for this example), although I don't imagine it's a
development priority for anybody right now. I agree with Antoine that
implementations should be able to choose how/if extension type authors
can leverage other capabilities of the engine.

If this is pursued further, it might be worth checking out a
particularly successful extensible vector system implemented in R via
the vctrs package ( https://vctrs.r-lib.org/ ). "vector" class authors
can implement one or more S3 methods (i.e., traits):

- vec_proxy(x) (get me the storage array)
- vec_ptype2(type1, type2) (given two types, get me a type that I can
cast both to or error)
- vec_cast(x, type) (perform a lossless cast to type or error)
- vec_proxy_equal(x) (get me storage array where == does the right thing)
- vec_proxy_order(x) (get me a storage array that sorts in the correct order)
- vec_math(op, x) (perform unary math ops like sum)
- vec_arith(op, lhs, rhs) (perform binary math ops like +, -, etc.)

Cheers!

-dewey

On Wed, Dec 13, 2023 at 12:39 PM Benjamin Kietzman <bengil...@gmail.com> wrote:
>
> The main problem I see with adding properties to ExtensionType is I'm not
> sure where that information would reside. Allowing type authors to declare
> in which ways the type is equivalent (or not) to its storage is appealing,
> but it seems to need an official extension field like
> ARROW:extension:semantics. Otherwise I think each extension type's
> semantics would need to be maintained within every implementation as well
> as in a central reference (probably in Columnar.rst), which seems
> unreasonable to expect of extension type authors. I'm also skeptical that
> useful information could be packed into an ARROW:extension:semantics field;
> even if the type can declare that ordering-as-with-storage is invalid I
> don't think it'd be feasible to specify the correct ordering.
>
> If we cannot attach this information to extension types, the question
> becomes which defaults are most reasonable for engines and how can the
> engine most usefully be configured outside those defaults. My own
> preference would be to refuse operations other than selection or
> casting-to-storage, with a runtime extensible registry of allowed implicit
> casts. This will allow users of the engine to configure their extension
> types as they need, and the error message raised when an implicit
> cast-to-storage is not allowed could include the suggestion to register the
> implicit cast. For applications built against a specific engine, this
> approach would allow recovering much of the advantage of attaching
> properties to an ExtensionType by including registration of implicit casts
> in the ExtensionType's initialization.
>
> On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman <bengil...@gmail.com>
> wrote:
>
> > Hello all,
> >
> > Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
> > any extension type to its storage type in acero. This raises questions
> > about the validity of applying operations to an extension array's storage.
> > For example, some extension type authors may intend different ordering for
> > arrays of their new type than would be applied to the array's storage or
> > may not intend for the type to participate in arithmetic even though its
> > storage could.
> >
> > Suggestions/observations from discussion on that PR included:
> > - Extension types could provide general semantic description of storage
> > type equivalence [2], so that a flag on the extension type enables ordering
> > by storage but disables arithmetic on it
> > - Compute functions or kernels could be augmented with a filter declaring
> > which extension types are supported [3].
> > - Currently arrow-rs considers extension types metadata only [4], so all
> > kernels treat extension arrays equivalently to their storage.
> > - Currently arrow c++ only supports explicitly casting from an extension
> > type to its storage (and from storage to ext), so any operation can be
> > performed on an extension array's storage but it requires opting in.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] https://github.com/apache/arrow/pull/39200
> > [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
> > [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
> > [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651
> >

Reply via email to