tobixdev opened a new issue, #8730:
URL: https://github.com/apache/arrow-rs/issues/8730
**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**
Currently, for users, it can be cumbersome to customize the behavior of
extension types. For example, consider a specialized pretty-printing
implementation for a certain type (e.g., format JSON).
In DataFusion this is currently not implemented. Even though we have started
to replace `DataType` with `Field`, this still requires us to pass through some
kind of extension type registry (github.com/apache/datafusion/issues/18223)
through all code paths that require access to the customized printing
implementation. The procedure would be to lookup the extension type in the
registry and then call the pretty-printing implementation.
While this is possible, I am currently exploring an approach that directly
associates a `dyn DynExtensionType` with the `Field`, thus making it possible
to access the pretty-printing implementation without passing a registry around.
I think `Field` would be a good candidate for that as it is currently used to
store the metadata.
Before undertaking any significant implementation effort, I think we should
have a discussion on how (and if) we want to support such customization options
in arrow-rs.
**Describe the solution you'd like**
I think there are two approaches to improve the situtation from arrow-rs:
For the `DataType` in the `Field` use a new `FieldType` enum:
```rust
pub enum FieldType {
Physical(DataType),
Extension(DataType, Arc<dyn DynExtensionType>)
}
```
or we add an additional field `extension_type` with the type `Option<Arc<dyn
DynExtensionType>>`.
The `DynExtensionType` would have an `as_any` method that allows users
(e.g., DataFusion) to cast to their specific extension type traits. If someone
has a better idea that does not rely on down casting, feel free to propose it.
I've whipped together a rough prototype of how this could look like (the API
is not really changed yet):
https://github.com/apache/arrow-rs/compare/main...tobixdev:arrow-rs:crazy-field-experiment?expand=1
Personally, I'd prefer the first solution but its a bigger breaking change.
It could be enough if we provide a `storage_type()` method that returns the
`DataType` how it is in the current version of arrow.
Of course, a registry will still be needed at some point. The pieces of code
that instantiate new Fields (e.g., parser) would require access to the
registry.
**Describe alternatives you've considered**
We can also keep these efforts completely in DataFusion. This would require
either i) creating something akin to `DataFusionField` or
`DataFusionExtensionInformation` or ii) pass a around a registry and use that
for looking up the pretty-printing implementation.
**Additional context**
There has been discussion on using a `DataType::ExtensionType(...)` enum
variant for the same purpose but AFAIK we decided against this approach as this
allows arrow kernels to focus on the physical data layout (which makes sense
IMO). Still, not needing a registry everywhere is an attractive aspect of this
solution that the `Field` approach could also provide.
Other links:
- https://github.com/apache/datafusion/issues/18223
- Pola.rs seems to pursue a [`DataType::Extensoin`
variant](https://github.com/geopolars/geopolars/issues/245)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]