The main problem I see with adding properties to ExtensionType is I'm not sure where that information would reside. Allowing type authors to declare in which ways the type is equivalent (or not) to its storage is appealing, but it seems to need an official extension field like ARROW:extension:semantics. Otherwise I think each extension type's semantics would need to be maintained within every implementation as well as in a central reference (probably in Columnar.rst), which seems unreasonable to expect of extension type authors. I'm also skeptical that useful information could be packed into an ARROW:extension:semantics field; even if the type can declare that ordering-as-with-storage is invalid I don't think it'd be feasible to specify the correct ordering.
If we cannot attach this information to extension types, the question becomes which defaults are most reasonable for engines and how can the engine most usefully be configured outside those defaults. My own preference would be to refuse operations other than selection or casting-to-storage, with a runtime extensible registry of allowed implicit casts. This will allow users of the engine to configure their extension types as they need, and the error message raised when an implicit cast-to-storage is not allowed could include the suggestion to register the implicit cast. For applications built against a specific engine, this approach would allow recovering much of the advantage of attaching properties to an ExtensionType by including registration of implicit casts in the ExtensionType's initialization. On Wed, Dec 13, 2023 at 10:46 AM Benjamin Kietzman <bengil...@gmail.com> wrote: > Hello all, > > Recently, a PR to arrow c++ [1] was opened to allow implicit casting from > any extension type to its storage type in acero. This raises questions > about the validity of applying operations to an extension array's storage. > For example, some extension type authors may intend different ordering for > arrays of their new type than would be applied to the array's storage or > may not intend for the type to participate in arithmetic even though its > storage could. > > Suggestions/observations from discussion on that PR included: > - Extension types could provide general semantic description of storage > type equivalence [2], so that a flag on the extension type enables ordering > by storage but disables arithmetic on it > - Compute functions or kernels could be augmented with a filter declaring > which extension types are supported [3]. > - Currently arrow-rs considers extension types metadata only [4], so all > kernels treat extension arrays equivalently to their storage. > - Currently arrow c++ only supports explicitly casting from an extension > type to its storage (and from storage to ext), so any operation can be > performed on an extension array's storage but it requires opting in. > > Sincerely, > Ben Kietzman > > [1] https://github.com/apache/arrow/pull/39200 > [2] https://github.com/apache/arrow/pull/39200#issuecomment-1852307954 > [3] https://github.com/apache/arrow/pull/39200#issuecomment-1852676161 > [4] https://github.com/apache/arrow/pull/39200#issuecomment-1852881651 >