Re: [DISCUSS] Semantics of extension types

2023-12-15 Thread Dewey Dunnington
I also like these equivalence traits...in addition to being easy for
extension type authors to specify when registering an extension type
in Arrow C++, implementations that allow registration like pyarrow and
arrow/R would be able to specify them easily, whereas implementing
methods, compute functions, or overloads to handle it (e.g., like is
done in vctrs with vec_proxy_equal, which often just returns its
input) would have performance implications (since the methods might
have to be defined in R or Python).

It may also be worth adding a compute function for "force storage" (a
no-op for anything except an extension array), which is maybe safer
than a cast (which implies, I think, some logical equivalence between
the input and the result). That would let a user work around a
situation where the extension type author didn't handle a case that
the user expected to work.

Cheers!

-dewey

On Fri, Dec 15, 2023 at 3:13 AM Jin Shang  wrote:
>
> I'm in favor of Antoine's proposal of storage equivalence traits[1]. For
> the sake of clarity I'll paste it here:
>
> I would suggest we perhaps need a more general semantic description of
> > storage type equivalence.
> > Draft:
> > class ExtensionType {
> > public:
> > // Storage equivalence for equality testing and hashing
> > static constexpr uint32_t kEquality = 1;
> > // Storage equivalence for ordered comparisons
> > static constexpr uint32_t kOrdering = 2;
> > // Storage equivalence for selections (filter, take, etc.)
> > static constexpr uint32_t kSelection = 4;
> > // Storage equivalence for arithmetic
> > static constexpr uint32_t kArithmetic = 8;
> > // Storage equivalence for explicit casts
> > static constexpr uint32_t kCasting = 16;
> > // Storage equivalence for all operations
> > static constexpr uint32_t kAny = std::numeric_limits::max();
> > // By default, an extension type can be implicitly handled as its storage
> > type
> > // for selections, equality testing and hashing.
> > virtual uint32_t storage_equivalence() const { return kEquality |
> > kSelection; }
> >
>
> I think this is well balanced between convenience and safety. The default
> option ensures the "normal" operations like take, group-by, unique... just
> work, and extension type authors can easily opt into additional functions.
>
> It also requires minimum engineering efforts. Each function only needs to
> specify what traits it requires, rather than the actual types.
>
> BTW I've checked every C++ compute function and I think the only traits
> missing here are one for string operations, and one for generation such as
> `random`.
>
> [1]  https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
>
> Best,
> Jin
>
> On Thu, Dec 14, 2023 at 10:04 PM Weston Pace  wrote:
>
> > I agree engines can use their own strategy.  Requiring explicit casts is
> > probably ok as long as it is well documented but I think I lean slightly
> > towards implicitly falling back to the storage type.  I do think think
> > people still shy away from extension types.  Adding the extension type to
> > an implicit cast registry is another hurdle to their use, albeit a small
> > one.
> >
> > Substrait has a similar consideration for extension types.  They can be
> > declared "inherits" (meaning the storage type can be used implicitly in
> > compute functions) or "separate" (meaning the storage type cannot be used
> > implicitly in compute functions).  This would map nicely to an Arrow
> > metadata field.
> >
> > Unfortunately, I think the truth is more nuanced than a simple
> > separate/inherits flag.  Take UUID for example (everyone's favorite fixed
> > size binary extension type).  We would definitely want to implicitly reuse
> > the hash, equality, and sorting functions.
> >
> > However, for other functions it gets trickier.  Imagine you have a
> > `replace_slice` function.  Should it return a new UUID (change some bytes
> > in a UUID and you have a new UUID) or not (once you start changing bytes in
> > a UUID you no longer have a UUID).  Or what if there was a `slice`
> > function.  This function should either be prohibited (you can't slice a
> > UUID) or should return a fixed size binary string (you can still slice it
> > but you no longer have a UUID).
> >
> > Given the complication I think users will always need to carefully consider
> > each use of an extension function no matter how smart a system is.  I'm not
> > convinced any metadata exists that could define the right approach in a
> > consistent number of cases.  This means our choice is whether we force
> > users to explicitly declare each such decision or we just trust that they
> > are doing the proper consideration when they design their plan.  I'm not
> > sure there is a right answer.  One can point to the vast diversity of ways
> > that programming languages have approached implicit vs explicit integer
> > casts.
> >
> > My last concern is that we rely on compute functions in operators other
> > than project/filter.  For example, to use a 

Re: [DISCUSS] Semantics of extension types

2023-12-14 Thread Jin Shang
I'm in favor of Antoine's proposal of storage equivalence traits[1]. For
the sake of clarity I'll paste it here:

I would suggest we perhaps need a more general semantic description of
> storage type equivalence.
> Draft:
> class ExtensionType {
> public:
> // Storage equivalence for equality testing and hashing
> static constexpr uint32_t kEquality = 1;
> // Storage equivalence for ordered comparisons
> static constexpr uint32_t kOrdering = 2;
> // Storage equivalence for selections (filter, take, etc.)
> static constexpr uint32_t kSelection = 4;
> // Storage equivalence for arithmetic
> static constexpr uint32_t kArithmetic = 8;
> // Storage equivalence for explicit casts
> static constexpr uint32_t kCasting = 16;
> // Storage equivalence for all operations
> static constexpr uint32_t kAny = std::numeric_limits::max();
> // By default, an extension type can be implicitly handled as its storage
> type
> // for selections, equality testing and hashing.
> virtual uint32_t storage_equivalence() const { return kEquality |
> kSelection; }
>

I think this is well balanced between convenience and safety. The default
option ensures the "normal" operations like take, group-by, unique... just
work, and extension type authors can easily opt into additional functions.

It also requires minimum engineering efforts. Each function only needs to
specify what traits it requires, rather than the actual types.

BTW I've checked every C++ compute function and I think the only traits
missing here are one for string operations, and one for generation such as
`random`.

[1]  https://github.com/apache/arrow/pull/39200#issuecomment-1852307954

Best,
Jin

On Thu, Dec 14, 2023 at 10:04 PM Weston Pace  wrote:

> I agree engines can use their own strategy.  Requiring explicit casts is
> probably ok as long as it is well documented but I think I lean slightly
> towards implicitly falling back to the storage type.  I do think think
> people still shy away from extension types.  Adding the extension type to
> an implicit cast registry is another hurdle to their use, albeit a small
> one.
>
> Substrait has a similar consideration for extension types.  They can be
> declared "inherits" (meaning the storage type can be used implicitly in
> compute functions) or "separate" (meaning the storage type cannot be used
> implicitly in compute functions).  This would map nicely to an Arrow
> metadata field.
>
> Unfortunately, I think the truth is more nuanced than a simple
> separate/inherits flag.  Take UUID for example (everyone's favorite fixed
> size binary extension type).  We would definitely want to implicitly reuse
> the hash, equality, and sorting functions.
>
> However, for other functions it gets trickier.  Imagine you have a
> `replace_slice` function.  Should it return a new UUID (change some bytes
> in a UUID and you have a new UUID) or not (once you start changing bytes in
> a UUID you no longer have a UUID).  Or what if there was a `slice`
> function.  This function should either be prohibited (you can't slice a
> UUID) or should return a fixed size binary string (you can still slice it
> but you no longer have a UUID).
>
> Given the complication I think users will always need to carefully consider
> each use of an extension function no matter how smart a system is.  I'm not
> convinced any metadata exists that could define the right approach in a
> consistent number of cases.  This means our choice is whether we force
> users to explicitly declare each such decision or we just trust that they
> are doing the proper consideration when they design their plan.  I'm not
> sure there is a right answer.  One can point to the vast diversity of ways
> that programming languages have approached implicit vs explicit integer
> casts.
>
> My last concern is that we rely on compute functions in operators other
> than project/filter.  For example, to use a column as a key for a hash-join
> we need to be able to compute the hash value and calculate equality.  To
> use a column as a key for sorting we need an ordering function.  These are
> places where it might be unexpected for users to insert explicit casts.  An
> engine would need to make sure the error message in these cases was very
> clear.
>
> On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou 
> wrote:
>
> >
> > Hi,
> >
> > For now, I would suggest that each implementation decides on their own
> > strategy, because we don't have a clear idea of which is better (and
> > extension types are probably not getting a lot of use yet).
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
> > > The main problem I see with adding properties to ExtensionType is I'm
> not
> > > sure where that information would reside. Allowing type authors to
> > declare
> > > in which ways the type is equivalent (or not) to its storage is
> > appealing,
> > > but it seems to need an official extension field like
> > > ARROW:extension:semantics. Otherwise I think each 

Re: [DISCUSS] Semantics of extension types

2023-12-14 Thread Weston Pace
I agree engines can use their own strategy.  Requiring explicit casts is
probably ok as long as it is well documented but I think I lean slightly
towards implicitly falling back to the storage type.  I do think think
people still shy away from extension types.  Adding the extension type to
an implicit cast registry is another hurdle to their use, albeit a small
one.

Substrait has a similar consideration for extension types.  They can be
declared "inherits" (meaning the storage type can be used implicitly in
compute functions) or "separate" (meaning the storage type cannot be used
implicitly in compute functions).  This would map nicely to an Arrow
metadata field.

Unfortunately, I think the truth is more nuanced than a simple
separate/inherits flag.  Take UUID for example (everyone's favorite fixed
size binary extension type).  We would definitely want to implicitly reuse
the hash, equality, and sorting functions.

However, for other functions it gets trickier.  Imagine you have a
`replace_slice` function.  Should it return a new UUID (change some bytes
in a UUID and you have a new UUID) or not (once you start changing bytes in
a UUID you no longer have a UUID).  Or what if there was a `slice`
function.  This function should either be prohibited (you can't slice a
UUID) or should return a fixed size binary string (you can still slice it
but you no longer have a UUID).

Given the complication I think users will always need to carefully consider
each use of an extension function no matter how smart a system is.  I'm not
convinced any metadata exists that could define the right approach in a
consistent number of cases.  This means our choice is whether we force
users to explicitly declare each such decision or we just trust that they
are doing the proper consideration when they design their plan.  I'm not
sure there is a right answer.  One can point to the vast diversity of ways
that programming languages have approached implicit vs explicit integer
casts.

My last concern is that we rely on compute functions in operators other
than project/filter.  For example, to use a column as a key for a hash-join
we need to be able to compute the hash value and calculate equality.  To
use a column as a key for sorting we need an ordering function.  These are
places where it might be unexpected for users to insert explicit casts.  An
engine would need to make sure the error message in these cases was very
clear.

On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou  wrote:

>
> Hi,
>
> For now, I would suggest that each implementation decides on their own
> strategy, because we don't have a clear idea of which is better (and
> extension types are probably not getting a lot of use yet).
>
> Regards
>
> Antoine.
>
>
> Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :
> > The main problem I see with adding properties to ExtensionType is I'm not
> > sure where that information would reside. Allowing type authors to
> declare
> > in which ways the type is equivalent (or not) to its storage is
> appealing,
> > but it seems to need an official extension field like
> > ARROW:extension:semantics. Otherwise I think each extension type's
> > semantics would need to be maintained within every implementation as well
> > as in a central reference (probably in Columnar.rst), which seems
> > unreasonable to expect of extension type authors. I'm also skeptical that
> > useful information could be packed into an ARROW:extension:semantics
> field;
> > even if the type can declare that ordering-as-with-storage is invalid I
> > don't think it'd be feasible to specify the correct ordering.
> >
> > If we cannot attach this information to extension types, the question
> > becomes which defaults are most reasonable for engines and how can the
> > engine most usefully be configured outside those defaults. My own
> > preference would be to refuse operations other than selection or
> > casting-to-storage, with a runtime extensible registry of allowed
> implicit
> > casts. This will allow users of the engine to configure their extension
> > types as they need, and the error message raised when an implicit
> > cast-to-storage is not allowed could include the suggestion to register
> the
> > implicit cast. For applications built against a specific engine, this
> > approach would allow recovering much of the advantage of attaching
> > properties to an ExtensionType by including registration of implicit
> casts
> > in the ExtensionType's initialization.
> >
> > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
> > wrote:
> >
> >> Hello all,
> >>
> >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting
> from
> >> any extension type to its storage type in acero. This raises questions
> >> about the validity of applying operations to an extension array's
> storage.
> >> For example, some extension type authors may intend different ordering
> for
> >> arrays of their new type than would be applied to the array's storage or
> >> may not 

Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Dewey Dunnington
Thank you for opening the discussion here and opening it up!

I agree that attaching semantics as metadata and/or documenting them
in a central repository is an unreasonable burden to put on extension
type authors and Arrow implementations in general.

I also agree that operations other than filter/take/concatenate should
error by default: just because a storage type happens to be an
integer, it doesn't necessarily mean that arithmetic (for example) is
meaningful. (For example, an extension type implementing a bitpacked
uint64 such as an S2 cell or H3 index would result in an invalid value
for "plus one" or "times three").

For query engines and/or implementations with extensive compute
capability like Arrow C++, it is useful to be able to leverage those
for extension types: for the S2/H3 index example, it would be very
cool to allow a group_by + aggregate to "just work" (since ==/hash
*is* valid for this example), although I don't imagine it's a
development priority for anybody right now. I agree with Antoine that
implementations should be able to choose how/if extension type authors
can leverage other capabilities of the engine.

If this is pursued further, it might be worth checking out a
particularly successful extensible vector system implemented in R via
the vctrs package ( https://vctrs.r-lib.org/ ). "vector" class authors
can implement one or more S3 methods (i.e., traits):

- vec_proxy(x) (get me the storage array)
- vec_ptype2(type1, type2) (given two types, get me a type that I can
cast both to or error)
- vec_cast(x, type) (perform a lossless cast to type or error)
- vec_proxy_equal(x) (get me storage array where == does the right thing)
- vec_proxy_order(x) (get me a storage array that sorts in the correct order)
- vec_math(op, x) (perform unary math ops like sum)
- vec_arith(op, lhs, rhs) (perform binary math ops like +, -, etc.)

Cheers!

-dewey

On Wed, Dec 13, 2023 at 12:39 PM Benjamin Kietzman  wrote:
>
> The main problem I see with adding properties to ExtensionType is I'm not
> sure where that information would reside. Allowing type authors to declare
> in which ways the type is equivalent (or not) to its storage is appealing,
> but it seems to need an official extension field like
> ARROW:extension:semantics. Otherwise I think each extension type's
> semantics would need to be maintained within every implementation as well
> as in a central reference (probably in Columnar.rst), which seems
> unreasonable to expect of extension type authors. I'm also skeptical that
> useful information could be packed into an ARROW:extension:semantics field;
> even if the type can declare that ordering-as-with-storage is invalid I
> don't think it'd be feasible to specify the correct ordering.
>
> If we cannot attach this information to extension types, the question
> becomes which defaults are most reasonable for engines and how can the
> engine most usefully be configured outside those defaults. My own
> preference would be to refuse operations other than selection or
> casting-to-storage, with a runtime extensible registry of allowed implicit
> casts. This will allow users of the engine to configure their extension
> types as they need, and the error message raised when an implicit
> cast-to-storage is not allowed could include the suggestion to register the
> implicit cast. For applications built against a specific engine, this
> approach would allow recovering much of the advantage of attaching
> properties to an ExtensionType by including registration of implicit casts
> in the ExtensionType's initialization.
>
> On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
> wrote:
>
> > Hello all,
> >
> > Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
> > any extension type to its storage type in acero. This raises questions
> > about the validity of applying operations to an extension array's storage.
> > For example, some extension type authors may intend different ordering for
> > arrays of their new type than would be applied to the array's storage or
> > may not intend for the type to participate in arithmetic even though its
> > storage could.
> >
> > Suggestions/observations from discussion on that PR included:
> > - Extension types could provide general semantic description of storage
> > type equivalence [2], so that a flag on the extension type enables ordering
> > by storage but disables arithmetic on it
> > - Compute functions or kernels could be augmented with a filter declaring
> > which extension types are supported [3].
> > - Currently arrow-rs considers extension types metadata only [4], so all
> > kernels treat extension arrays equivalently to their storage.
> > - Currently arrow c++ only supports explicitly casting from an extension
> > type to its storage (and from storage to ext), so any operation can be
> > performed on an extension array's storage but it requires opting in.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1] 

Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Antoine Pitrou



Hi,

For now, I would suggest that each implementation decides on their own 
strategy, because we don't have a clear idea of which is better (and 
extension types are probably not getting a lot of use yet).


Regards

Antoine.


Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit :

The main problem I see with adding properties to ExtensionType is I'm not
sure where that information would reside. Allowing type authors to declare
in which ways the type is equivalent (or not) to its storage is appealing,
but it seems to need an official extension field like
ARROW:extension:semantics. Otherwise I think each extension type's
semantics would need to be maintained within every implementation as well
as in a central reference (probably in Columnar.rst), which seems
unreasonable to expect of extension type authors. I'm also skeptical that
useful information could be packed into an ARROW:extension:semantics field;
even if the type can declare that ordering-as-with-storage is invalid I
don't think it'd be feasible to specify the correct ordering.

If we cannot attach this information to extension types, the question
becomes which defaults are most reasonable for engines and how can the
engine most usefully be configured outside those defaults. My own
preference would be to refuse operations other than selection or
casting-to-storage, with a runtime extensible registry of allowed implicit
casts. This will allow users of the engine to configure their extension
types as they need, and the error message raised when an implicit
cast-to-storage is not allowed could include the suggestion to register the
implicit cast. For applications built against a specific engine, this
approach would allow recovering much of the advantage of attaching
properties to an ExtensionType by including registration of implicit casts
in the ExtensionType's initialization.

On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
wrote:


Hello all,

Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
any extension type to its storage type in acero. This raises questions
about the validity of applying operations to an extension array's storage.
For example, some extension type authors may intend different ordering for
arrays of their new type than would be applied to the array's storage or
may not intend for the type to participate in arithmetic even though its
storage could.

Suggestions/observations from discussion on that PR included:
- Extension types could provide general semantic description of storage
type equivalence [2], so that a flag on the extension type enables ordering
by storage but disables arithmetic on it
- Compute functions or kernels could be augmented with a filter declaring
which extension types are supported [3].
- Currently arrow-rs considers extension types metadata only [4], so all
kernels treat extension arrays equivalently to their storage.
- Currently arrow c++ only supports explicitly casting from an extension
type to its storage (and from storage to ext), so any operation can be
performed on an extension array's storage but it requires opting in.

Sincerely,
Ben Kietzman

[1] https://github.com/apache/arrow/pull/39200
[2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
[3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
[4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651





Re: [DISCUSS] Semantics of extension types

2023-12-13 Thread Benjamin Kietzman
The main problem I see with adding properties to ExtensionType is I'm not
sure where that information would reside. Allowing type authors to declare
in which ways the type is equivalent (or not) to its storage is appealing,
but it seems to need an official extension field like
ARROW:extension:semantics. Otherwise I think each extension type's
semantics would need to be maintained within every implementation as well
as in a central reference (probably in Columnar.rst), which seems
unreasonable to expect of extension type authors. I'm also skeptical that
useful information could be packed into an ARROW:extension:semantics field;
even if the type can declare that ordering-as-with-storage is invalid I
don't think it'd be feasible to specify the correct ordering.

If we cannot attach this information to extension types, the question
becomes which defaults are most reasonable for engines and how can the
engine most usefully be configured outside those defaults. My own
preference would be to refuse operations other than selection or
casting-to-storage, with a runtime extensible registry of allowed implicit
casts. This will allow users of the engine to configure their extension
types as they need, and the error message raised when an implicit
cast-to-storage is not allowed could include the suggestion to register the
implicit cast. For applications built against a specific engine, this
approach would allow recovering much of the advantage of attaching
properties to an ExtensionType by including registration of implicit casts
in the ExtensionType's initialization.

On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman 
wrote:

> Hello all,
>
> Recently, a PR to arrow c++ [1] was opened to allow implicit casting from
> any extension type to its storage type in acero. This raises questions
> about the validity of applying operations to an extension array's storage.
> For example, some extension type authors may intend different ordering for
> arrays of their new type than would be applied to the array's storage or
> may not intend for the type to participate in arithmetic even though its
> storage could.
>
> Suggestions/observations from discussion on that PR included:
> - Extension types could provide general semantic description of storage
> type equivalence [2], so that a flag on the extension type enables ordering
> by storage but disables arithmetic on it
> - Compute functions or kernels could be augmented with a filter declaring
> which extension types are supported [3].
> - Currently arrow-rs considers extension types metadata only [4], so all
> kernels treat extension arrays equivalently to their storage.
> - Currently arrow c++ only supports explicitly casting from an extension
> type to its storage (and from storage to ext), so any operation can be
> performed on an extension array's storage but it requires opting in.
>
> Sincerely,
> Ben Kietzman
>
> [1] https://github.com/apache/arrow/pull/39200
> [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954
> [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161
> [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651
>