I'm in favor of Antoine's proposal of storage equivalence traits[1]. For
the sake of clarity I'll paste it here:

I would suggest we perhaps need a more general semantic description of
> storage type equivalence.
> Draft:
> class ExtensionType {
> public:
> // Storage equivalence for equality testing and hashing
> static constexpr uint32_t kEquality = 1;
> // Storage equivalence for ordered comparisons
> static constexpr uint32_t kOrdering = 2;
> // Storage equivalence for selections (filter, take, etc.)
> static constexpr uint32_t kSelection = 4;
> // Storage equivalence for arithmetic
> static constexpr uint32_t kArithmetic = 8;
> // Storage equivalence for explicit casts
> static constexpr uint32_t kCasting = 16;
> // Storage equivalence for all operations
> static constexpr uint32_t kAny = std::numeric_limits<uint32_t>::max();
> // By default, an extension type can be implicitly handled as its storage
> type
> // for selections, equality testing and hashing.
> virtual uint32_t storage_equivalence() const { return kEquality |
> kSelection; }
>

I think this is well balanced between convenience and safety. The default
option ensures the "normal" operations like take, group-by, unique... just
work, and extension type authors can easily opt into additional functions.

It also requires minimum engineering efforts. Each function only needs to
specify what traits it requires, rather than the actual types.

BTW I've checked every C++ compute function and I think the only traits
missing here are one for string operations, and one for generation such as
`random`.

[1]  https://github.com/apache/arrow/pull/39200#issuecomment-1852307954

Best,
Jin

On Thu, Dec 14, 2023 at 10:04 PM Weston Pace <weston.p...@gmail.com> wrote:

> I agree engines can use their own strategy.  Requiring explicit casts is
> probably ok as long as it is well documented but I think I lean slightly
> towards implicitly falling back to the storage type.  I do think think
> people still shy away from extension types.  Adding the extension type to
> an implicit cast registry is another hurdle to their use, albeit a small
> one.
>
> Substrait has a similar consideration for extension types.  They can be
> declared "inherits" (meaning the storage type can be used implicitly in
> compute functions) or "separate" (meaning the storage type cannot be used
> implicitly in compute functions).  This would map nicely to an Arrow
> metadata field.
>
> Unfortunately, I think the truth is more nuanced than a simple
> separate/inherits flag.  Take UUID for example (everyone's favorite fixed
> size binary extension type).  We would definitely want to implicitly reuse
> the hash, equality, and sorting functions.
>
> However, for other functions it gets trickier.  Imagine you have a
> `replace_slice` function.  Should it return a new UUID (change some bytes
> in a UUID and you have a new UUID) or not (once you start changing bytes in
> a UUID you no longer have a UUID).  Or what if there was a `slice`
> function.  This function should either be prohibited (you can't slice a
> UUID) or should return a fixed size binary string (you can still slice it
> but you no longer have a UUID).
>
> Given the complication I think users will always need to carefully consider
> each use of an extension function no matter how smart a system is.  I'm not
> convinced any metadata exists that could define the right approach in a
> consistent number of cases.  This means our choice is whether we force
> users to explicitly declare each such decision or we just trust that they
> are doing the proper consideration when they design their plan.  I'm not
> sure there is a right answer.  One can point to the vast diversity of ways
> that programming languages have approached implicit vs explicit integer
> casts.
>
> My last concern is that we rely on compute functions in operators other
> than project/filter.  For example, to use a column as a key for a hash-join
> we need to be able to compute the hash value and calculate equality.  To
> use a column as a key for sorting we need an ordering function.  These are
> places where it might be unexpected for users to insert explicit casts.  An
> engine would need to make sure the error message in these cases was very
> clear.
>
> On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou <anto...@python.org>
> wrote:
>
> >
> > Hi,
> >
> > For now, I would suggest that each implementation decides on their own
> > strategy, because we don't have a clear idea of which is better (and
> > extension types are probably not getting a lot of use yet).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
> > > The main problem I see with adding properties to ExtensionType is I'm
> not
> > > sure where that information would reside. Allowing type authors to
> > declare
> > > in which ways the type is equivalent (or not) to its storage is
> > appealing,
> > > but it seems to need an official extension field like
> > > ARROW:extension:semantics. Otherwise I think each extension type's
> > > semantics would need to be maintained within every implementation as
> well
> > > as in a central reference (probably in Columnar.rst), which seems
> > > unreasonable to expect of extension type authors. I'm also skeptical
> that
> > > useful information could be packed into an ARROW:extension:semantics
> > field;
> > > even if the type can declare that ordering-as-with-storage is invalid I
> > > don't think it'd be feasible to specify the correct ordering.
> > >
> > > If we cannot attach this information to extension types, the question
> > > becomes which defaults are most reasonable for engines and how can the
> > > engine most usefully be configured outside those defaults. My own
> > > preference would be to refuse operations other than selection or
> > > casting-to-storage, with a runtime extensible registry of allowed
> > implicit
> > > casts. This will allow users of the engine to configure their extension
> > > types as they need, and the error message raised when an implicit
> > > cast-to-storage is not allowed could include the suggestion to register
> > the
> > > implicit cast. For applications built against a specific engine, this
> > > approach would allow recovering much of the advantage of attaching
> > > properties to an ExtensionType by including registration of implicit
> > casts
> > > in the ExtensionType's initialization.
> > >
> > > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman <
> bengil...@gmail.com>
> > > wrote:
> > >
> > >> Hello all,
> > >>
> > >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting
> > from
> > >> any extension type to its storage type in acero. This raises questions
> > >> about the validity of applying operations to an extension array's
> > storage.
> > >> For example, some extension type authors may intend different ordering
> > for
> > >> arrays of their new type than would be applied to the array's storage
> or
> > >> may not intend for the type to participate in arithmetic even though
> its
> > >> storage could.
> > >>
> > >> Suggestions/observations from discussion on that PR included:
> > >> - Extension types could provide general semantic description of
> storage
> > >> type equivalence [2], so that a flag on the extension type enables
> > ordering
> > >> by storage but disables arithmetic on it
> > >> - Compute functions or kernels could be augmented with a filter
> > declaring
> > >> which extension types are supported [3].
> > >> - Currently arrow-rs considers extension types metadata only [4], so
> all
> > >> kernels treat extension arrays equivalently to their storage.
> > >> - Currently arrow c++ only supports explicitly casting from an
> extension
> > >> type to its storage (and from storage to ext), so any operation can be
> > >> performed on an extension array's storage but it requires opting in.
> > >>
> > >> Sincerely,
> > >> Ben Kietzman
> > >>
> > >> [1] https://github.com/apache/arrow/pull/39200
> > >> [2]
> https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
> > >> [3]
> https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
> > >> [4]
> https://github.com/apache/arrow/pull/39200#issuecomment-1852881651
> > >>
> > >
> >
>

Reply via email to