tschwarzinger commented on code in PR #23169:
URL: https://github.com/apache/datafusion/pull/23169#discussion_r3497634425


##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -63,6 +64,10 @@ pub struct CastExpr {
     target_field: FieldRef,
     /// Cast options
     cast_options: CastOptions<'static>,
+    /// Whether to preserve non-extension metadata from the source field.
+    /// When true (default), source metadata is merged with target metadata.
+    /// When false, only the target field's metadata is used.

Review Comment:
   If we keep this flag, should we use this flag too in `to_field` too? Sorry I 
am not too familiar with the casting system. And if the answer is yes, should 
we merge the metadata or just retain the source metadata like we currently do 
(I think)?



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
     -> Result<(DataType, bool)>;
 }
 
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target 
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and 
`ARROW:extension:metadata`)
+///   are taken from the target field, overwriting any extension metadata from 
the source.
+///   This ensures casting does not incorrectly propagate extension type 
identity.
+///
 /// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
 fn cast_output_field(
     source_field: &FieldRef,
-    target_type: &DataType,
+    target_field: &FieldRef,
     force_nullable: bool,
 ) -> Arc<Field> {
+    use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY, 
EXTENSION_TYPE_NAME_KEY};
+
+    // Start with source metadata
+    let mut metadata = source_field.metadata().clone();

Review Comment:
   Just to clarify for myself:
   - Creating a custom `Expr::Cast` with custom metadata (e.g., 
`my_custom_key`) has never really worked in the sense that `my_custom_key` 
becomes part of the output metadata.
   - And it also does not work in this branch as we only consider the extension 
name / metadata from the target field.
   
   I think this is fine and if we ever need something like this we can track it 
in a separate ticket with a use case that requires this sort of casts. I just 
to make sure that we are not removing the ability to include custom metadata 
for downstream users.



##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -150,19 +182,50 @@ impl CastExpr {
         &self.cast_options
     }
 
+    /// Whether source metadata is preserved through the cast.
+    pub fn preserve_source_metadata(&self) -> bool {
+        self.preserve_source_metadata
+    }
+
     fn resolved_target_field(&self, input_schema: &Schema) -> Result<FieldRef> 
{
-        if is_default_target_field(&self.target_field) {
-            self.expr.return_field(input_schema).map(|field| {
+        // When using exact target field mode, return the target field directly
+        // without consulting the source expression (which may reference 
columns
+        // beyond the schema, e.g., virtual row-index columns appended at scan 
time).
+        if !self.preserve_source_metadata && 
!is_default_target_field(&self.target_field)
+        {
+            return Ok(Arc::clone(&self.target_field));
+        }
+
+        self.expr.return_field(input_schema).map(|source_field| {
+            if is_default_target_field(&self.target_field) {
+                // Type-only cast: derive from source field but strip 
extension metadata
+                let mut metadata = source_field.metadata().clone();
+                metadata.remove(EXTENSION_TYPE_NAME_KEY);
+                metadata.remove(EXTENSION_TYPE_METADATA_KEY);
+
                 Arc::new(
-                    field
+                    source_field
                         .as_ref()
                         .clone()
-                        .with_data_type(self.cast_type().clone()),
+                        .with_data_type(self.cast_type().clone())
+                        .with_metadata(metadata),
                 )
-            })
-        } else {
-            Ok(Arc::clone(&self.target_field))
-        }
+            } else {
+                // Explicit target field with source metadata preservation:
+                // - Start with source's non-extension metadata
+                // - Then add all target metadata (including extension type 
metadata)
+                let mut metadata = source_field.metadata().clone();
+                metadata.remove(EXTENSION_TYPE_NAME_KEY);
+                metadata.remove(EXTENSION_TYPE_METADATA_KEY);
+
+                // Target metadata takes precedence (including extension type 
metadata)
+                for (k, v) in self.target_field.metadata() {

Review Comment:
   Here we use the target metadata fields ("custom metadata") but not in 
`cast_output_field`, which confuses me a bit.



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -1208,4 +1229,121 @@ mod tests {
 
         assert_eq!(meta, expr.metadata(&schema).unwrap());
     }
+
+    #[test]
+    fn test_cast_extension_type_metadata() {
+        use crate::expr::Cast;
+        use arrow_schema::extension::{
+            EXTENSION_TYPE_METADATA_KEY, EXTENSION_TYPE_NAME_KEY,
+        };
+
+        // Create a schema with a field that has extension type metadata
+        let mut source_meta = HashMap::new();
+        source_meta.insert(
+            EXTENSION_TYPE_NAME_KEY.to_string(),
+            "arrow.uuid".to_string(),
+        );
+        source_meta.insert("custom_key".to_string(), 
"custom_value".to_string());
+
+        let source_field = Field::new("foo", DataType::FixedSizeBinary(16), 
false)
+            .with_metadata(source_meta);
+
+        let schema = MockExprSchema::new()
+            .with_data_type(DataType::FixedSizeBinary(16))
+            
.with_metadata(FieldMetadata::from(source_field.metadata().clone()));
+
+        // Test 1: Cast to a type without extension metadata strips extension 
metadata
+        // but preserves non-extension metadata
+        let cast_expr = Expr::Cast(Cast {
+            expr: Box::new(col("foo")),
+            field: Arc::new(Field::new("", DataType::Utf8, true)),
+        });
+
+        let (_, result_field) = cast_expr.to_field(&schema).unwrap();
+        assert!(
+            result_field
+                .metadata()
+                .get(EXTENSION_TYPE_NAME_KEY)
+                .is_none(),
+            "Extension type name should be stripped when target has no 
extension metadata"
+        );
+        assert_eq!(
+            result_field.metadata().get("custom_key"),
+            Some(&"custom_value".to_string()),
+            "Non-extension metadata should be preserved"
+        );
+
+        // Test 2: Cast to a field with different extension type replaces 
extension metadata
+        let mut target_meta = HashMap::new();
+        target_meta.insert(
+            EXTENSION_TYPE_NAME_KEY.to_string(),
+            "arrow.json".to_string(),
+        );
+        target_meta.insert(EXTENSION_TYPE_METADATA_KEY.to_string(), 
"{}".to_string());
+
+        let target_field =
+            Field::new("", DataType::Utf8, true).with_metadata(target_meta);
+
+        let cast_expr = Expr::Cast(Cast {
+            expr: Box::new(col("foo")),
+            field: Arc::new(target_field),
+        });
+
+        let (_, result_field) = cast_expr.to_field(&schema).unwrap();
+        assert_eq!(
+            result_field.metadata().get(EXTENSION_TYPE_NAME_KEY),
+            Some(&"arrow.json".to_string()),
+            "Extension type name should come from target field"
+        );
+        assert_eq!(
+            result_field.metadata().get(EXTENSION_TYPE_METADATA_KEY),
+            Some(&"{}".to_string()),
+            "Extension type metadata should come from target field"
+        );
+        assert_eq!(
+            result_field.metadata().get("custom_key"),
+            Some(&"custom_value".to_string()),
+            "Non-extension metadata should still be preserved from source"
+        );
+    }
+
+    #[test]
+    fn test_try_cast_extension_type_metadata() {

Review Comment:
   Maybe we can combine these two tests and just parameterize them with 
`Expr::TryCast` and `Expr::Cast` ?
   
   Currently, the test for `Expr::Cast` seems to check that the metadata of the 
target field is used, while the test for `Expr::TryCast` does not test that. We 
can maybe avoid the nullability check in `Expr::TryCast` as this has nothing to 
do with extension types and should be done in a separate tests.
   
   However, this could make the test more "complicated" and keeping them apart 
might be better. Up to you how you want to deal with it.



##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -1208,6 +1272,85 @@ mod tests {
 
         Ok(())
     }
+
+    #[test]
+    fn type_only_cast_strips_extension_metadata() -> Result<()> {
+        // When using type-only cast (new()), extension metadata from source 
should NOT propagate
+        let source_meta = HashMap::from([
+            (
+                EXTENSION_TYPE_NAME_KEY.to_string(),
+                "arrow.uuid".to_string(),
+            ),
+            ("custom_key".to_string(), "custom_value".to_string()),
+        ]);
+        let schema = Schema::new(vec![
+            Field::new("a", FixedSizeBinary(16), 
false).with_metadata(source_meta),
+        ]);
+
+        let expr = CastExpr::new(col("a", &schema)?, Utf8, None);
+
+        let field = expr.return_field(&schema)?;
+        assert!(
+            field.metadata().get(EXTENSION_TYPE_NAME_KEY).is_none(),
+            "Type-only cast should strip extension type name from source"
+        );
+        assert_eq!(
+            field.metadata().get("custom_key"),
+            Some(&"custom_value".to_string()),
+            "Type-only cast should preserve non-extension metadata"
+        );
+
+        Ok(())
+    }
+
+    #[test]
+    fn field_aware_cast_applies_target_extension_metadata() -> Result<()> {
+        // When using field-aware cast, target's extension metadata should be 
applied
+        let source_meta = HashMap::from([
+            (
+                EXTENSION_TYPE_NAME_KEY.to_string(),
+                "source.type".to_string(),
+            ),
+            ("source_key".to_string(), "source_value".to_string()),

Review Comment:
   Maybe adding "target_key" to the `target_meta` would be beneficial too.



##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -63,6 +64,10 @@ pub struct CastExpr {
     target_field: FieldRef,
     /// Cast options
     cast_options: CastOptions<'static>,
+    /// Whether to preserve non-extension metadata from the source field.
+    /// When true (default), source metadata is merged with target metadata.
+    /// When false, only the target field's metadata is used.
+    preserve_source_metadata: bool,

Review Comment:
   Hmm maybe `rewrite_file_row_index_expr` should not use the `CastExpr` for 
this purpose but an "intrinsic" UDF? But I guess this is a different issue 
then. Not sure how to otherwise avoid that (especially, because we would then 
require this flag for `TryCast` if we want feature parity).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to