tobixdev opened a new pull request, #18552: URL: https://github.com/apache/datafusion/pull/18552
## Which issue does this PR close? This is a draft for #18223 . The APIs are not to be considered final (e.g., options are missing in the pretty printer). The primary purpose is to spark discussion for now. So happy to hear inputs! ## Rationale for this change How cool would it be to just state that you should properly format my byte-encoded uuids? :) ## What changes are included in this PR? - Defines the `LogicalType` trait for some canonical extension types from arrow. - Defines `UnresolvedExtensionType`, a "DataFusion canonical extension type" that can be used to create a `LogicalType` instance even without a registry. The creation functions for `DFSchema` could make use of this type, assuming that `DFSchema` should have access to logical types. Furthermore, these function could directly instantiate the canonical arrow extension types as they are known to the system. Then the functions could resolve native and canonical extension types without an access to the registry and then "delay" the resolving of the custom extension types. The idea is that there is then a "Type Resolver Pass" that has access to a registry and replaces all instances of this type with the actual one. While I hope that this is only a temporary solution until all places have access to a logical type registry, I think this has the potential to become a "permanent temporary solution". With this in mind, we could also consider making this explicit with an enum and not hide it b ehind dynamic dispatch. - Defines an incomplete `ValuePrettyPrinter` for showcasing the UUID pretty printing. - Plumbing for having `ExtensionTypeRegistry` in `SessionState` What is also important is what is *not* included: an integrative example of making use of the pretty printer. I tried several avenues but, as you can imagine, each change to the core data structure is a huge plumbing effort (hopefully reduced by the existence of `UnresolvedLogicalType`). I really like the suggestion by @paleolimbot to use pretty-printing record batches as the first use case. You can see a mini example in the test that pretty-prints UUIDs. The nice thing is that this probably would not require much plumbing as the [DataFrame] already has access to the [SessionState]. The only thing that's missing for me to actually include this example here is that `arrow-rs` does not currently support passing custom pretty printers in `pretty_format_batches_with_options`. Imagine that the `to_string` function in the `DataFrame` does the following: 1. Look up any extension type information from the schema (in a future world this would already be part of the schema and another lookup is not necessary) 2. Gather the pretty printers 3. Pass in pretty printer to arrow-rs for formatting. If you think this is a worthwhile pursuit we could add the capability to arrow-rs. ## Are these changes tested? Not really, as there is not integrative example yet. ## Are there any user-facing changes? There would be. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
