tobixdev opened a new issue, #8730:
URL: https://github.com/apache/arrow-rs/issues/8730

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Currently, for users, it can be cumbersome to customize the behavior of 
extension types. For example, consider a specialized pretty-printing 
implementation for a certain type (e.g., format JSON). 
   
   In DataFusion this is currently not implemented. Even though we have started 
to replace `DataType` with `Field`, this still requires us to pass through some 
kind of extension type registry (github.com/apache/datafusion/issues/18223) 
through all code paths that require access to the customized printing 
implementation. The procedure would be to lookup the extension type in the 
registry and then call the pretty-printing implementation.
   
   While this is possible, I am currently exploring an approach that directly 
associates a `dyn DynExtensionType` with the `Field`, thus making it possible 
to access the pretty-printing implementation without passing a registry around. 
I think `Field` would be a good candidate for that as it is currently used to 
store the metadata.
   
   Before undertaking any significant implementation effort, I think we should 
have a discussion on how (and if) we want to support such customization options 
in arrow-rs.  
   
   **Describe the solution you'd like**
   
   I think there are two approaches to improve the situtation from arrow-rs:
   
   For the `DataType` in the `Field` use a new `FieldType` enum:
   
   ```rust
   pub enum FieldType {
       Physical(DataType),
       Extension(DataType, Arc<dyn DynExtensionType>)
   }
   ```
   
   or we add an additional field `extension_type` with the type `Option<Arc<dyn 
DynExtensionType>>`. 
   
   The `DynExtensionType` would have an `as_any` method that allows users 
(e.g., DataFusion) to cast to their specific extension type traits. If someone 
has a better idea that does not rely on down casting, feel free to propose it.
   
   I've whipped together a rough prototype of how this could look like (the API 
is not really changed yet): 
   
https://github.com/apache/arrow-rs/compare/main...tobixdev:arrow-rs:crazy-field-experiment?expand=1
   
   Personally, I'd prefer the first solution but its a bigger breaking change. 
It could be enough if we provide a `storage_type()` method that returns the 
`DataType` how it is in the current version of arrow.
   
   Of course, a registry will still be needed at some point. The pieces of code 
that instantiate new Fields (e.g., parser) would require access to the 
registry. 
   
   **Describe alternatives you've considered**
   
   We can also keep these efforts completely in DataFusion. This would require 
either i) creating something akin to `DataFusionField` or 
`DataFusionExtensionInformation` or ii) pass a around a registry and use that 
for looking up the pretty-printing implementation. 
   
   **Additional context**
   
   There has been discussion on using a `DataType::ExtensionType(...)` enum 
variant for the same purpose but AFAIK we decided against this approach as this 
allows arrow kernels to focus on the physical data layout (which makes sense 
IMO). Still, not needing a registry everywhere is an attractive aspect of this 
solution that the `Field` approach could also provide.
   
   Other links:
   - https://github.com/apache/datafusion/issues/18223
   - Pola.rs seems to pursue a [`DataType::Extensoin` 
variant](https://github.com/geopolars/geopolars/issues/245)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to