I agree engines can use their own strategy.  Requiring explicit casts is
probably ok as long as it is well documented but I think I lean slightly
towards implicitly falling back to the storage type.  I do think think
people still shy away from extension types.  Adding the extension type to
an implicit cast registry is another hurdle to their use, albeit a small
one.

Substrait has a similar consideration for extension types.  They can be
declared "inherits" (meaning the storage type can be used implicitly in
compute functions) or "separate" (meaning the storage type cannot be used
implicitly in compute functions).  This would map nicely to an Arrow
metadata field.

Unfortunately, I think the truth is more nuanced than a simple
separate/inherits flag.  Take UUID for example (everyone's favorite fixed
size binary extension type).  We would definitely want to implicitly reuse
the hash, equality, and sorting functions.

However, for other functions it gets trickier.  Imagine you have a
`replace_slice` function.  Should it return a new UUID (change some bytes
in a UUID and you have a new UUID) or not (once you start changing bytes in
a UUID you no longer have a UUID).  Or what if there was a `slice`
function.  This function should either be prohibited (you can't slice a
UUID) or should return a fixed size binary string (you can still slice it
but you no longer have a UUID).

Given the complication I think users will always need to carefully consider
each use of an extension function no matter how smart a system is.  I'm not
convinced any metadata exists that could define the right approach in a
consistent number of cases.  This means our choice is whether we force
users to explicitly declare each such decision or we just trust that they
are doing the proper consideration when they design their plan.  I'm not
sure there is a right answer.  One can point to the vast diversity of ways
that programming languages have approached implicit vs explicit integer
casts.

My last concern is that we rely on compute functions in operators other
than project/filter.  For example, to use a column as a key for a hash-join
we need to be able to compute the hash value and calculate equality.  To
use a column as a key for sorting we need an ordering function.  These are
places where it might be unexpected for users to insert explicit casts.  An
engine would need to make sure the error message in these cases was very
clear.

On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi,
>
> For now, I would suggest that each implementation decides on their own
> strategy, because we don't have a clear idea of which is better (and
> extension types are probably not getting a lot of use yet).
>
> Regards
>
> Antoine.
>
>
> Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
> > The main problem I see with adding properties to ExtensionType is I'm not
> > sure where that information would reside. Allowing type authors to
> declare
> > in which ways the type is equivalent (or not) to its storage is
> appealing,
> > but it seems to need an official extension field like
> > ARROW:extension:semantics. Otherwise I think each extension type's
> > semantics would need to be maintained within every implementation as well
> > as in a central reference (probably in Columnar.rst), which seems
> > unreasonable to expect of extension type authors. I'm also skeptical that
> > useful information could be packed into an ARROW:extension:semantics
> field;
> > even if the type can declare that ordering-as-with-storage is invalid I
> > don't think it'd be feasible to specify the correct ordering.
> >
> > If we cannot attach this information to extension types, the question
> > becomes which defaults are most reasonable for engines and how can the
> > engine most usefully be configured outside those defaults. My own
> > preference would be to refuse operations other than selection or
> > casting-to-storage, with a runtime extensible registry of allowed
> implicit
> > casts. This will allow users of the engine to configure their extension
> > types as they need, and the error message raised when an implicit
> > cast-to-storage is not allowed could include the suggestion to register
> the
> > implicit cast. For applications built against a specific engine, this
> > approach would allow recovering much of the advantage of attaching
> > properties to an ExtensionType by including registration of implicit
> casts
> > in the ExtensionType's initialization.
> >
> > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman <bengil...@gmail.com>
> > wrote:
> >
> >> Hello all,
> >>
> >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting
> from
> >> any extension type to its storage type in acero. This raises questions
> >> about the validity of applying operations to an extension array's
> storage.
> >> For example, some extension type authors may intend different ordering
> for
> >> arrays of their new type than would be applied to the array's storage or
> >> may not intend for the type to participate in arithmetic even though its
> >> storage could.
> >>
> >> Suggestions/observations from discussion on that PR included:
> >> - Extension types could provide general semantic description of storage
> >> type equivalence [2], so that a flag on the extension type enables
> ordering
> >> by storage but disables arithmetic on it
> >> - Compute functions or kernels could be augmented with a filter
> declaring
> >> which extension types are supported [3].
> >> - Currently arrow-rs considers extension types metadata only [4], so all
> >> kernels treat extension arrays equivalently to their storage.
> >> - Currently arrow c++ only supports explicitly casting from an extension
> >> type to its storage (and from storage to ext), so any operation can be
> >> performed on an extension array's storage but it requires opting in.
> >>
> >> Sincerely,
> >> Ben Kietzman
> >>
> >> [1] https://github.com/apache/arrow/pull/39200
> >> [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
> >> [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
> >> [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651
> >>
> >
>

Reply via email to