paleolimbot commented on PR #18552:
URL: https://github.com/apache/datafusion/pull/18552#issuecomment-3774267690

   Thank you Andrew for reviewing and thank you Tobias for the continued work 
and discussion!
   
   > I think this is the key observation / design point -- specifically, what 
is the shape of "information necessary"
   
   For us it is mostly just modifying a logical plan (e.g., replace SortExpr of 
`geometry` with `sd_order(geometry)`; replace `Cast(...)` with a scalar UDF 
call to `sd_cast()`). 
   
   > Thus, from where I sit the only way we are going to get LogicalTypes 
through DataFusion without a bunch more re-plumbing will be to use the Field 
metadata.
   
   I agree that a "serialized first" approach is the path of least resistance. 
It is not how any other query engine has implemented extension type and 
involves copying a lot of metadata, but it seems like there is a reasonable 
chance PRs that implement things in this way can get reviewed and merged 
because they don't require massive changes to the codebase or changes to 
arrow-rs.
   
   Pragmatically and with respect to this PR, I think this means that we should 
perhaps decouple the `LogicalType` from the extension registry. If we invented 
a new trait to encapsulate a "bound" extension type instead of adding to the 
LogicalType trait, we could have:
   
   ```rust
   pub trait BoundExtension {
       /// Returns an [`ArrayFormatter`] that can format values of this type.
       ///
       /// If `Ok(None)` is returned, the default implementation will be used.
       /// If an error is returned, there was an error creating the formatter.
       fn create_array_formatter<'fmt>(
           &self,
           _array: &'fmt dyn Array,
           _options: &FormatOptions<'fmt>,
       ) -> Result<Option<ArrayFormatter<'fmt>>> {
           Ok(None)
       }
   }
   ```
   
   Then the `ExtensionTypeRegistration`'s job (maybe this could also be called 
a `TypeExtension` or `UnboundExtension`?) would be to "bind" field metadata to 
an implementation (just replacing `LogcialType` with the `BoundExtension` 
here). It also might be that we don't need to bind a concrete data 
type/metadata for some things but I do think it's important to have somewhere 
(amortizes the cost of parsing JSON metadata).
   
   ```rust
   pub trait ExtensionTypeRegistration: Debug + Send + Sync {
       /// The name of the extension type.
       ///
       /// This name will be used to find the correct 
[ExtensionTypeRegistration] when an extension
       /// type is encountered.
       fn type_name(&self) -> &str;
   
       /// Creates a bound extension instance from the provided 
`extension_type_name` and an optional
       /// metadata.
       fn bind_extension(
           &self,
           storage_type: &DataType,
           extension_type_name: &str,
           metadata: Option<&str>,
       ) -> Result<BoundExtensionRef>;
   }
   ```
   
   This would unblock implementing extension functionality in places where we 
have access to the `session`, and we can still:
   
   - Pursue a DataFusion logical type (by adding a `get_bound_extension()` 
member to the logical type trait later)
   - Embedding a pointer in a Field (perhaps to the unbound extension type or 
registry to avoid inconsistency when a data type or metadata of a field is 
modified, or a bound extension that is cleared or errors when that information 
is modified)
   - Adding `DataType::Extension()` (this would save a reference to the 
BoundExtension in this example).
   - Or none of these things
   
   There are some things this won't solve that embedding a logical type might. 
Notably, implementing traits like Display, Hash, and PartialEq where there 
isn't and won't ever be a reference to the session. For those things we could 
use a `static` session to unblock compile-time support for things like variant 
and uuid. Or just not implement those things until we have a way to persist a 
BoundExtension.
   
   I'm not particularly attached to any of those words...our workarounds in 
SedonaDB are fine for now and this isn't blocking anything for me. My point is 
perhaps just that coupling the `LogicalType` to the `ExtensionRegistry` is 
holding up a lot of great ideas in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to