paleolimbot commented on PR #18552:
URL: https://github.com/apache/datafusion/pull/18552#issuecomment-3774267690
Thank you Andrew for reviewing and thank you Tobias for the continued work
and discussion!
> I think this is the key observation / design point -- specifically, what
is the shape of "information necessary"
For us it is mostly just modifying a logical plan (e.g., replace SortExpr of
`geometry` with `sd_order(geometry)`; replace `Cast(...)` with a scalar UDF
call to `sd_cast()`).
> Thus, from where I sit the only way we are going to get LogicalTypes
through DataFusion without a bunch more re-plumbing will be to use the Field
metadata.
I agree that a "serialized first" approach is the path of least resistance.
It is not how any other query engine has implemented extension type and
involves copying a lot of metadata, but it seems like there is a reasonable
chance PRs that implement things in this way can get reviewed and merged
because they don't require massive changes to the codebase or changes to
arrow-rs.
Pragmatically and with respect to this PR, I think this means that we should
perhaps decouple the `LogicalType` from the extension registry. If we invented
a new trait to encapsulate a "bound" extension type instead of adding to the
LogicalType trait, we could have:
```rust
pub trait BoundExtension {
/// Returns an [`ArrayFormatter`] that can format values of this type.
///
/// If `Ok(None)` is returned, the default implementation will be used.
/// If an error is returned, there was an error creating the formatter.
fn create_array_formatter<'fmt>(
&self,
_array: &'fmt dyn Array,
_options: &FormatOptions<'fmt>,
) -> Result<Option<ArrayFormatter<'fmt>>> {
Ok(None)
}
}
```
Then the `ExtensionTypeRegistration`'s job (maybe this could also be called
a `TypeExtension` or `UnboundExtension`?) would be to "bind" field metadata to
an implementation (just replacing `LogcialType` with the `BoundExtension`
here). It also might be that we don't need to bind a concrete data
type/metadata for some things but I do think it's important to have somewhere
(amortizes the cost of parsing JSON metadata).
```rust
pub trait ExtensionTypeRegistration: Debug + Send + Sync {
/// The name of the extension type.
///
/// This name will be used to find the correct
[ExtensionTypeRegistration] when an extension
/// type is encountered.
fn type_name(&self) -> &str;
/// Creates a bound extension instance from the provided
`extension_type_name` and an optional
/// metadata.
fn bind_extension(
&self,
storage_type: &DataType,
extension_type_name: &str,
metadata: Option<&str>,
) -> Result<BoundExtensionRef>;
}
```
This would unblock implementing extension functionality in places where we
have access to the `session`, and we can still:
- Pursue a DataFusion logical type (by adding a `get_bound_extension()`
member to the logical type trait later)
- Embedding a pointer in a Field (perhaps to the unbound extension type or
registry to avoid inconsistency when a data type or metadata of a field is
modified, or a bound extension that is cleared or errors when that information
is modified)
- Adding `DataType::Extension()` (this would save a reference to the
BoundExtension in this example).
- Or none of these things
There are some things this won't solve that embedding a logical type might.
Notably, implementing traits like Display, Hash, and PartialEq where there
isn't and won't ever be a reference to the session. For those things we could
use a `static` session to unblock compile-time support for things like variant
and uuid. Or just not implement those things until we have a way to persist a
BoundExtension.
I'm not particularly attached to any of those words...our workarounds in
SedonaDB are fine for now and this isn't blocking anything for me. My point is
perhaps just that coupling the `LogicalType` to the `ExtensionRegistry` is
holding up a lot of great ideas in this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]