paleolimbot commented on code in PR #20312:
URL: https://github.com/apache/datafusion/pull/20312#discussion_r2989837166
##########
datafusion/expr/src/registry.rs:
##########
@@ -215,3 +218,280 @@ impl FunctionRegistry for MemoryFunctionRegistry {
self.udwfs.keys().cloned().collect()
}
}
+
+/// A cheaply cloneable pointer to an [ExtensionTypeRegistration].
+pub type ExtensionTypeRegistrationRef = Arc<dyn ExtensionTypeRegistration>;
+
+/// The registration of an extension type. Implementations of this trait are
responsible for
+/// *creating* instances of [`DFExtensionType`] that represent the entire
semantics of an extension
+/// type.
+///
+/// # Why do we need a Registration?
+///
+/// A good question is why this trait is even necessary. Why not directly
register the
+/// [`DFExtensionType`] in a registration?
+///
+/// While this works for extension types requiring no additional metadata
(e.g., `arrow.uuid`), it
+/// does not work for more complex extension types with metadata. For example,
consider an extension
+/// type `custom.shortened(n)` that aims to short the pretty-printing string
to `n` characters.
+/// Here, `n` is a parameter of the extension type and should be a field in
the struct that
+/// implements the [`DFExtensionType`]. The job of the registration is to read
the metadata from the
+/// field and create the corresponding [`DFExtensionType`] instance with the
correct `n` set.
+///
+/// The [`DefaultExtensionTypeRegistration`] provides a convenient way of
creating registrations.
+pub trait ExtensionTypeRegistration: Debug + Send + Sync {
+ /// The name of the extension type.
+ ///
+ /// This name will be used to find the correct [ExtensionTypeRegistration]
when an extension
+ /// type is encountered.
+ fn type_name(&self) -> &str;
+
+ /// Creates an extension type instance from the optional metadata. The
name of the extension
+ /// type is not a parameter as it's already defined by the registration
itself.
+ fn create_df_extension_type(
+ &self,
+ metadata: Option<&str>,
+ ) -> Result<DFExtensionTypeRef>;
+}
Review Comment:
My preference would be to pass `storage_type: &DataType, metadata:
Option<&str>` here (i.e., a DFExtensionType is a fully resolved "type" with all
of its parameters, and it has been validated before it is returned). This is
consistent with the "instance of arrow-rs `ExtensionType` == instance of
DFExtensionType analogy.
The alternative would be that the storage type is always provided to the
members of `DFExtensionType` (e.g., the `ArrayRef` of the formatter already
provides this in this PR so it works). If that's the angle we're shooting for,
it might make more sense to also omit `metadata` (but I'm not sure that would
make it more useful).
##########
datafusion/core/src/execution/session_state.rs:
##########
@@ -266,6 +271,10 @@ impl Session for SessionState {
&self.window_functions
}
+ fn extension_type_registry(&self) -> &ExtensionTypeRegistryRef {
+ &self.extension_types
+ }
Review Comment:
I don't think a change is necessarily required in this PR here, but in
https://github.com/apache/datafusion/pull/21071 it's necessary to use `fn
execution_props()` to add in the extension types such that optimizer rules and
creating a physical expression can see the extension type registry.
##########
datafusion/core/tests/extension_types/pretty_printing.rs:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::array::{FixedSizeBinaryArray, RecordBatch};
+use arrow_schema::extension::Uuid;
+use arrow_schema::{DataType, Field, Schema, SchemaRef};
+use datafusion::dataframe::DataFrame;
+use datafusion::error::Result;
+use datafusion::execution::SessionStateBuilder;
+use datafusion::prelude::SessionContext;
+use insta::assert_snapshot;
+use std::sync::Arc;
+
+fn test_schema() -> SchemaRef {
+ Arc::new(Schema::new(vec![uuid_field()]))
+}
+
+fn uuid_field() -> Field {
+ Field::new("my_uuids", DataType::FixedSizeBinary(16),
false).with_extension_type(Uuid)
+}
+
+async fn create_test_table() -> Result<DataFrame> {
+ let schema = test_schema();
+
+ // define data.
+ let batch = RecordBatch::try_new(
+ schema,
+ vec![Arc::new(FixedSizeBinaryArray::from(vec![
+ &[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
+ &[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 5, 6],
+ ]))],
+ )?;
+
+ let state = SessionStateBuilder::default()
+ .with_canonical_extension_types()?
+ .build();
+ let ctx = SessionContext::new_with_state(state);
+
+ ctx.register_batch("test", batch)?;
+
+ ctx.table("test").await
+}
+
+#[tokio::test]
+async fn test_pretty_print_logical_plan() -> Result<()> {
+ let result = create_test_table().await?.to_string().await?;
+
+ assert_snapshot!(
+ result,
+ @r"
+ +--------------------------------------+
+ | my_uuids |
+ +--------------------------------------+
+ | 00000000-0000-0000-0000-000000000000 |
+ | 00010203-0405-0607-0809-000102030506 |
+ +--------------------------------------+
+ "
+ );
Review Comment:
I have to revise this slightly because I tried `assert_bathches_eq()` and it
doesn't quite work because it goes through a different pathway.
`assert_snapshot()` works great though and is just as easy!
##########
datafusion/expr/src/registry.rs:
##########
@@ -215,3 +218,280 @@ impl FunctionRegistry for MemoryFunctionRegistry {
self.udwfs.keys().cloned().collect()
}
}
+
+/// A cheaply cloneable pointer to an [ExtensionTypeRegistration].
+pub type ExtensionTypeRegistrationRef = Arc<dyn ExtensionTypeRegistration>;
+
+/// The registration of an extension type. Implementations of this trait are
responsible for
+/// *creating* instances of [`DFExtensionType`] that represent the entire
semantics of an extension
+/// type.
+///
+/// # Why do we need a Registration?
+///
+/// A good question is why this trait is even necessary. Why not directly
register the
+/// [`DFExtensionType`] in a registration?
+///
+/// While this works for extension types requiring no additional metadata
(e.g., `arrow.uuid`), it
+/// does not work for more complex extension types with metadata. For example,
consider an extension
+/// type `custom.shortened(n)` that aims to short the pretty-printing string
to `n` characters.
+/// Here, `n` is a parameter of the extension type and should be a field in
the struct that
+/// implements the [`DFExtensionType`]. The job of the registration is to read
the metadata from the
+/// field and create the corresponding [`DFExtensionType`] instance with the
correct `n` set.
+///
+/// The [`DefaultExtensionTypeRegistration`] provides a convenient way of
creating registrations.
+pub trait ExtensionTypeRegistration: Debug + Send + Sync {
Review Comment:
Should these definitions live next to `DFExtensionType` in
datafusion_common? The logic to implement it is highly related to the logic to
implement the DFExtensionType and it probably makes sense to implement them
(e.g, for UUID) together.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]