Re: [DISCUSS] Semantics of extension types
I also like these equivalence traits...in addition to being easy for extension type authors to specify when registering an extension type in Arrow C++, implementations that allow registration like pyarrow and arrow/R would be able to specify them easily, whereas implementing methods, compute functions, or overloads to handle it (e.g., like is done in vctrs with vec_proxy_equal, which often just returns its input) would have performance implications (since the methods might have to be defined in R or Python). It may also be worth adding a compute function for "force storage" (a no-op for anything except an extension array), which is maybe safer than a cast (which implies, I think, some logical equivalence between the input and the result). That would let a user work around a situation where the extension type author didn't handle a case that the user expected to work. Cheers! -dewey On Fri, Dec 15, 2023 at 3:13 AM Jin Shang wrote: > > I'm in favor of Antoine's proposal of storage equivalence traits[1]. For > the sake of clarity I'll paste it here: > > I would suggest we perhaps need a more general semantic description of > > storage type equivalence. > > Draft: > > class ExtensionType { > > public: > > // Storage equivalence for equality testing and hashing > > static constexpr uint32_t kEquality = 1; > > // Storage equivalence for ordered comparisons > > static constexpr uint32_t kOrdering = 2; > > // Storage equivalence for selections (filter, take, etc.) > > static constexpr uint32_t kSelection = 4; > > // Storage equivalence for arithmetic > > static constexpr uint32_t kArithmetic = 8; > > // Storage equivalence for explicit casts > > static constexpr uint32_t kCasting = 16; > > // Storage equivalence for all operations > > static constexpr uint32_t kAny = std::numeric_limits::max(); > > // By default, an extension type can be implicitly handled as its storage > > type > > // for selections, equality testing and hashing. > > virtual uint32_t storage_equivalence() const { return kEquality | > > kSelection; } > > > > I think this is well balanced between convenience and safety. The default > option ensures the "normal" operations like take, group-by, unique... just > work, and extension type authors can easily opt into additional functions. > > It also requires minimum engineering efforts. Each function only needs to > specify what traits it requires, rather than the actual types. > > BTW I've checked every C++ compute function and I think the only traits > missing here are one for string operations, and one for generation such as > `random`. > > [1] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 > > Best, > Jin > > On Thu, Dec 14, 2023 at 10:04 PM Weston Pace wrote: > > > I agree engines can use their own strategy. Requiring explicit casts is > > probably ok as long as it is well documented but I think I lean slightly > > towards implicitly falling back to the storage type. I do think think > > people still shy away from extension types. Adding the extension type to > > an implicit cast registry is another hurdle to their use, albeit a small > > one. > > > > Substrait has a similar consideration for extension types. They can be > > declared "inherits" (meaning the storage type can be used implicitly in > > compute functions) or "separate" (meaning the storage type cannot be used > > implicitly in compute functions). This would map nicely to an Arrow > > metadata field. > > > > Unfortunately, I think the truth is more nuanced than a simple > > separate/inherits flag. Take UUID for example (everyone's favorite fixed > > size binary extension type). We would definitely want to implicitly reuse > > the hash, equality, and sorting functions. > > > > However, for other functions it gets trickier. Imagine you have a > > `replace_slice` function. Should it return a new UUID (change some bytes > > in a UUID and you have a new UUID) or not (once you start changing bytes in > > a UUID you no longer have a UUID). Or what if there was a `slice` > > function. This function should either be prohibited (you can't slice a > > UUID) or should return a fixed size binary string (you can still slice it > > but you no longer have a UUID). > > > > Given the complication I think users will always need to carefully consider > > each use of an extension function no matter how smart a system is. I'm not > > convinced any metadata exists that could define the right approach in a > > consistent number of cases. This means our choice is whether we force > > users to explicitly declare each such decision or we just trust that they > > are doing the proper consideration when they design their plan. I'm not > > sure there is a right answer. One can point to the vast diversity of ways > > that programming languages have approached implicit vs explicit integer > > casts. > > > > My last concern is that we rely on compute functions in operators other > > than project/filter. For example, to use a
Re: [DISCUSS] Semantics of extension types
I'm in favor of Antoine's proposal of storage equivalence traits[1]. For the sake of clarity I'll paste it here: I would suggest we perhaps need a more general semantic description of > storage type equivalence. > Draft: > class ExtensionType { > public: > // Storage equivalence for equality testing and hashing > static constexpr uint32_t kEquality = 1; > // Storage equivalence for ordered comparisons > static constexpr uint32_t kOrdering = 2; > // Storage equivalence for selections (filter, take, etc.) > static constexpr uint32_t kSelection = 4; > // Storage equivalence for arithmetic > static constexpr uint32_t kArithmetic = 8; > // Storage equivalence for explicit casts > static constexpr uint32_t kCasting = 16; > // Storage equivalence for all operations > static constexpr uint32_t kAny = std::numeric_limits::max(); > // By default, an extension type can be implicitly handled as its storage > type > // for selections, equality testing and hashing. > virtual uint32_t storage_equivalence() const { return kEquality | > kSelection; } > I think this is well balanced between convenience and safety. The default option ensures the "normal" operations like take, group-by, unique... just work, and extension type authors can easily opt into additional functions. It also requires minimum engineering efforts. Each function only needs to specify what traits it requires, rather than the actual types. BTW I've checked every C++ compute function and I think the only traits missing here are one for string operations, and one for generation such as `random`. [1] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 Best, Jin On Thu, Dec 14, 2023 at 10:04 PM Weston Pace wrote: > I agree engines can use their own strategy. Requiring explicit casts is > probably ok as long as it is well documented but I think I lean slightly > towards implicitly falling back to the storage type. I do think think > people still shy away from extension types. Adding the extension type to > an implicit cast registry is another hurdle to their use, albeit a small > one. > > Substrait has a similar consideration for extension types. They can be > declared "inherits" (meaning the storage type can be used implicitly in > compute functions) or "separate" (meaning the storage type cannot be used > implicitly in compute functions). This would map nicely to an Arrow > metadata field. > > Unfortunately, I think the truth is more nuanced than a simple > separate/inherits flag. Take UUID for example (everyone's favorite fixed > size binary extension type). We would definitely want to implicitly reuse > the hash, equality, and sorting functions. > > However, for other functions it gets trickier. Imagine you have a > `replace_slice` function. Should it return a new UUID (change some bytes > in a UUID and you have a new UUID) or not (once you start changing bytes in > a UUID you no longer have a UUID). Or what if there was a `slice` > function. This function should either be prohibited (you can't slice a > UUID) or should return a fixed size binary string (you can still slice it > but you no longer have a UUID). > > Given the complication I think users will always need to carefully consider > each use of an extension function no matter how smart a system is. I'm not > convinced any metadata exists that could define the right approach in a > consistent number of cases. This means our choice is whether we force > users to explicitly declare each such decision or we just trust that they > are doing the proper consideration when they design their plan. I'm not > sure there is a right answer. One can point to the vast diversity of ways > that programming languages have approached implicit vs explicit integer > casts. > > My last concern is that we rely on compute functions in operators other > than project/filter. For example, to use a column as a key for a hash-join > we need to be able to compute the hash value and calculate equality. To > use a column as a key for sorting we need an ordering function. These are > places where it might be unexpected for users to insert explicit casts. An > engine would need to make sure the error message in these cases was very > clear. > > On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou > wrote: > > > > > Hi, > > > > For now, I would suggest that each implementation decides on their own > > strategy, because we don't have a clear idea of which is better (and > > extension types are probably not getting a lot of use yet). > > > > Regards > > > > Antoine. > > > > > > Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit : > > > The main problem I see with adding properties to ExtensionType is I'm > not > > > sure where that information would reside. Allowing type authors to > > declare > > > in which ways the type is equivalent (or not) to its storage is > > appealing, > > > but it seems to need an official extension field like > > > ARROW:extension:semantics. Otherwise I think each
Re: [DISCUSS] Semantics of extension types
I agree engines can use their own strategy. Requiring explicit casts is probably ok as long as it is well documented but I think I lean slightly towards implicitly falling back to the storage type. I do think think people still shy away from extension types. Adding the extension type to an implicit cast registry is another hurdle to their use, albeit a small one. Substrait has a similar consideration for extension types. They can be declared "inherits" (meaning the storage type can be used implicitly in compute functions) or "separate" (meaning the storage type cannot be used implicitly in compute functions). This would map nicely to an Arrow metadata field. Unfortunately, I think the truth is more nuanced than a simple separate/inherits flag. Take UUID for example (everyone's favorite fixed size binary extension type). We would definitely want to implicitly reuse the hash, equality, and sorting functions. However, for other functions it gets trickier. Imagine you have a `replace_slice` function. Should it return a new UUID (change some bytes in a UUID and you have a new UUID) or not (once you start changing bytes in a UUID you no longer have a UUID). Or what if there was a `slice` function. This function should either be prohibited (you can't slice a UUID) or should return a fixed size binary string (you can still slice it but you no longer have a UUID). Given the complication I think users will always need to carefully consider each use of an extension function no matter how smart a system is. I'm not convinced any metadata exists that could define the right approach in a consistent number of cases. This means our choice is whether we force users to explicitly declare each such decision or we just trust that they are doing the proper consideration when they design their plan. I'm not sure there is a right answer. One can point to the vast diversity of ways that programming languages have approached implicit vs explicit integer casts. My last concern is that we rely on compute functions in operators other than project/filter. For example, to use a column as a key for a hash-join we need to be able to compute the hash value and calculate equality. To use a column as a key for sorting we need an ordering function. These are places where it might be unexpected for users to insert explicit casts. An engine would need to make sure the error message in these cases was very clear. On Wed, Dec 13, 2023 at 12:54 PM Antoine Pitrou wrote: > > Hi, > > For now, I would suggest that each implementation decides on their own > strategy, because we don't have a clear idea of which is better (and > extension types are probably not getting a lot of use yet). > > Regards > > Antoine. > > > Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit : > > The main problem I see with adding properties to ExtensionType is I'm not > > sure where that information would reside. Allowing type authors to > declare > > in which ways the type is equivalent (or not) to its storage is > appealing, > > but it seems to need an official extension field like > > ARROW:extension:semantics. Otherwise I think each extension type's > > semantics would need to be maintained within every implementation as well > > as in a central reference (probably in Columnar.rst), which seems > > unreasonable to expect of extension type authors. I'm also skeptical that > > useful information could be packed into an ARROW:extension:semantics > field; > > even if the type can declare that ordering-as-with-storage is invalid I > > don't think it'd be feasible to specify the correct ordering. > > > > If we cannot attach this information to extension types, the question > > becomes which defaults are most reasonable for engines and how can the > > engine most usefully be configured outside those defaults. My own > > preference would be to refuse operations other than selection or > > casting-to-storage, with a runtime extensible registry of allowed > implicit > > casts. This will allow users of the engine to configure their extension > > types as they need, and the error message raised when an implicit > > cast-to-storage is not allowed could include the suggestion to register > the > > implicit cast. For applications built against a specific engine, this > > approach would allow recovering much of the advantage of attaching > > properties to an ExtensionType by including registration of implicit > casts > > in the ExtensionType's initialization. > > > > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman > > wrote: > > > >> Hello all, > >> > >> Recently, a PR to arrow c++ [1] was opened to allow implicit casting > from > >> any extension type to its storage type in acero. This raises questions > >> about the validity of applying operations to an extension array's > storage. > >> For example, some extension type authors may intend different ordering > for > >> arrays of their new type than would be applied to the array's storage or > >> may not
Re: [DISCUSS] Semantics of extension types
Thank you for opening the discussion here and opening it up! I agree that attaching semantics as metadata and/or documenting them in a central repository is an unreasonable burden to put on extension type authors and Arrow implementations in general. I also agree that operations other than filter/take/concatenate should error by default: just because a storage type happens to be an integer, it doesn't necessarily mean that arithmetic (for example) is meaningful. (For example, an extension type implementing a bitpacked uint64 such as an S2 cell or H3 index would result in an invalid value for "plus one" or "times three"). For query engines and/or implementations with extensive compute capability like Arrow C++, it is useful to be able to leverage those for extension types: for the S2/H3 index example, it would be very cool to allow a group_by + aggregate to "just work" (since ==/hash *is* valid for this example), although I don't imagine it's a development priority for anybody right now. I agree with Antoine that implementations should be able to choose how/if extension type authors can leverage other capabilities of the engine. If this is pursued further, it might be worth checking out a particularly successful extensible vector system implemented in R via the vctrs package ( https://vctrs.r-lib.org/ ). "vector" class authors can implement one or more S3 methods (i.e., traits): - vec_proxy(x) (get me the storage array) - vec_ptype2(type1, type2) (given two types, get me a type that I can cast both to or error) - vec_cast(x, type) (perform a lossless cast to type or error) - vec_proxy_equal(x) (get me storage array where == does the right thing) - vec_proxy_order(x) (get me a storage array that sorts in the correct order) - vec_math(op, x) (perform unary math ops like sum) - vec_arith(op, lhs, rhs) (perform binary math ops like +, -, etc.) Cheers! -dewey On Wed, Dec 13, 2023 at 12:39 PM Benjamin Kietzman wrote: > > The main problem I see with adding properties to ExtensionType is I'm not > sure where that information would reside. Allowing type authors to declare > in which ways the type is equivalent (or not) to its storage is appealing, > but it seems to need an official extension field like > ARROW:extension:semantics. Otherwise I think each extension type's > semantics would need to be maintained within every implementation as well > as in a central reference (probably in Columnar.rst), which seems > unreasonable to expect of extension type authors. I'm also skeptical that > useful information could be packed into an ARROW:extension:semantics field; > even if the type can declare that ordering-as-with-storage is invalid I > don't think it'd be feasible to specify the correct ordering. > > If we cannot attach this information to extension types, the question > becomes which defaults are most reasonable for engines and how can the > engine most usefully be configured outside those defaults. My own > preference would be to refuse operations other than selection or > casting-to-storage, with a runtime extensible registry of allowed implicit > casts. This will allow users of the engine to configure their extension > types as they need, and the error message raised when an implicit > cast-to-storage is not allowed could include the suggestion to register the > implicit cast. For applications built against a specific engine, this > approach would allow recovering much of the advantage of attaching > properties to an ExtensionType by including registration of implicit casts > in the ExtensionType's initialization. > > On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman > wrote: > > > Hello all, > > > > Recently, a PR to arrow c++ [1] was opened to allow implicit casting from > > any extension type to its storage type in acero. This raises questions > > about the validity of applying operations to an extension array's storage. > > For example, some extension type authors may intend different ordering for > > arrays of their new type than would be applied to the array's storage or > > may not intend for the type to participate in arithmetic even though its > > storage could. > > > > Suggestions/observations from discussion on that PR included: > > - Extension types could provide general semantic description of storage > > type equivalence [2], so that a flag on the extension type enables ordering > > by storage but disables arithmetic on it > > - Compute functions or kernels could be augmented with a filter declaring > > which extension types are supported [3]. > > - Currently arrow-rs considers extension types metadata only [4], so all > > kernels treat extension arrays equivalently to their storage. > > - Currently arrow c++ only supports explicitly casting from an extension > > type to its storage (and from storage to ext), so any operation can be > > performed on an extension array's storage but it requires opting in. > > > > Sincerely, > > Ben Kietzman > > > > [1]
Re: [DISCUSS] Semantics of extension types
Hi, For now, I would suggest that each implementation decides on their own strategy, because we don't have a clear idea of which is better (and extension types are probably not getting a lot of use yet). Regards Antoine. Le 13/12/2023 à 17:39, Benjamin Kietzman a écrit : The main problem I see with adding properties to ExtensionType is I'm not sure where that information would reside. Allowing type authors to declare in which ways the type is equivalent (or not) to its storage is appealing, but it seems to need an official extension field like ARROW:extension:semantics. Otherwise I think each extension type's semantics would need to be maintained within every implementation as well as in a central reference (probably in Columnar.rst), which seems unreasonable to expect of extension type authors. I'm also skeptical that useful information could be packed into an ARROW:extension:semantics field; even if the type can declare that ordering-as-with-storage is invalid I don't think it'd be feasible to specify the correct ordering. If we cannot attach this information to extension types, the question becomes which defaults are most reasonable for engines and how can the engine most usefully be configured outside those defaults. My own preference would be to refuse operations other than selection or casting-to-storage, with a runtime extensible registry of allowed implicit casts. This will allow users of the engine to configure their extension types as they need, and the error message raised when an implicit cast-to-storage is not allowed could include the suggestion to register the implicit cast. For applications built against a specific engine, this approach would allow recovering much of the advantage of attaching properties to an ExtensionType by including registration of implicit casts in the ExtensionType's initialization. On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman wrote: Hello all, Recently, a PR to arrow c++ [1] was opened to allow implicit casting from any extension type to its storage type in acero. This raises questions about the validity of applying operations to an extension array's storage. For example, some extension type authors may intend different ordering for arrays of their new type than would be applied to the array's storage or may not intend for the type to participate in arithmetic even though its storage could. Suggestions/observations from discussion on that PR included: - Extension types could provide general semantic description of storage type equivalence [2], so that a flag on the extension type enables ordering by storage but disables arithmetic on it - Compute functions or kernels could be augmented with a filter declaring which extension types are supported [3]. - Currently arrow-rs considers extension types metadata only [4], so all kernels treat extension arrays equivalently to their storage. - Currently arrow c++ only supports explicitly casting from an extension type to its storage (and from storage to ext), so any operation can be performed on an extension array's storage but it requires opting in. Sincerely, Ben Kietzman [1] https://github.com/apache/arrow/pull/39200 [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161 [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651
Re: [DISCUSS] Semantics of extension types
The main problem I see with adding properties to ExtensionType is I'm not sure where that information would reside. Allowing type authors to declare in which ways the type is equivalent (or not) to its storage is appealing, but it seems to need an official extension field like ARROW:extension:semantics. Otherwise I think each extension type's semantics would need to be maintained within every implementation as well as in a central reference (probably in Columnar.rst), which seems unreasonable to expect of extension type authors. I'm also skeptical that useful information could be packed into an ARROW:extension:semantics field; even if the type can declare that ordering-as-with-storage is invalid I don't think it'd be feasible to specify the correct ordering. If we cannot attach this information to extension types, the question becomes which defaults are most reasonable for engines and how can the engine most usefully be configured outside those defaults. My own preference would be to refuse operations other than selection or casting-to-storage, with a runtime extensible registry of allowed implicit casts. This will allow users of the engine to configure their extension types as they need, and the error message raised when an implicit cast-to-storage is not allowed could include the suggestion to register the implicit cast. For applications built against a specific engine, this approach would allow recovering much of the advantage of attaching properties to an ExtensionType by including registration of implicit casts in the ExtensionType's initialization. On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman wrote: > Hello all, > > Recently, a PR to arrow c++ [1] was opened to allow implicit casting from > any extension type to its storage type in acero. This raises questions > about the validity of applying operations to an extension array's storage. > For example, some extension type authors may intend different ordering for > arrays of their new type than would be applied to the array's storage or > may not intend for the type to participate in arithmetic even though its > storage could. > > Suggestions/observations from discussion on that PR included: > - Extension types could provide general semantic description of storage > type equivalence [2], so that a flag on the extension type enables ordering > by storage but disables arithmetic on it > - Compute functions or kernels could be augmented with a filter declaring > which extension types are supported [3]. > - Currently arrow-rs considers extension types metadata only [4], so all > kernels treat extension arrays equivalently to their storage. > - Currently arrow c++ only supports explicitly casting from an extension > type to its storage (and from storage to ext), so any operation can be > performed on an extension array's storage but it requires opting in. > > Sincerely, > Ben Kietzman > > [1] https://github.com/apache/arrow/pull/39200 > [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 > [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161 > [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651 >