alamb commented on code in PR #23169:
URL: https://github.com/apache/datafusion/pull/23169#discussion_r3507474686


##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
     -> Result<(DataType, bool)>;
 }
 
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target 
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and 
`ARROW:extension:metadata`)
+///   are taken from the target field, overwriting any extension metadata from 
the source.
+///   This ensures casting does not incorrectly propagate extension type 
identity.
+///
 /// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
 fn cast_output_field(
     source_field: &FieldRef,
-    target_type: &DataType,
+    target_field: &FieldRef,
     force_nullable: bool,
 ) -> Arc<Field> {
+    use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY, 
EXTENSION_TYPE_NAME_KEY};
+
+    // Start with source metadata
+    let mut metadata = source_field.metadata().clone();
+
+    // Remove any extension type metadata from source - these should not 
propagate through casts
+    metadata.remove(EXTENSION_TYPE_NAME_KEY);
+    metadata.remove(EXTENSION_TYPE_METADATA_KEY);
+
+    // Add extension type metadata from the target field if present
+    let target_metadata = target_field.metadata();
+    if let Some(name) = target_metadata.get(EXTENSION_TYPE_NAME_KEY) {
+        metadata.insert(EXTENSION_TYPE_NAME_KEY.to_string(), name.clone());
+    }
+    if let Some(ext_meta) = target_metadata.get(EXTENSION_TYPE_METADATA_KEY) {
+        metadata.insert(EXTENSION_TYPE_METADATA_KEY.to_string(), 
ext_meta.clone());
+    }
+

Review Comment:
   It feels strange to me that when casting to a new Field that the metadata is 
taken from the *source*
   
   I think I would personally expect it to come from the target Field (as we 
are casting "to" that field) -- though I see that is not what the current code 
does
   
   



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
     -> Result<(DataType, bool)>;
 }
 
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target 
fields.
+///
+/// Metadata handling:

Review Comment:
   Since this is an internal function, I don't think this documentation will be 
easy to find  -- that is not to say it isn't good to have. However I think we 
should probably also make a more authoritative docs
   
   For example, it would be nice to link to this detail from here
   
   
https://github.com/apache/datafusion/blob/9e8dd76d6deb6736c51962d9c97e04be4e3f1fc9/datafusion/expr/src/expr_schema.rs#L458



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
     -> Result<(DataType, bool)>;
 }
 
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target 
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and 
`ARROW:extension:metadata`)
+///   are taken from the target field, overwriting any extension metadata from 
the source.
+///   This ensures casting does not incorrectly propagate extension type 
identity.
+///
 /// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
 fn cast_output_field(
     source_field: &FieldRef,
-    target_type: &DataType,
+    target_field: &FieldRef,
     force_nullable: bool,
 ) -> Arc<Field> {
+    use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY, 
EXTENSION_TYPE_NAME_KEY};
+
+    // Start with source metadata
+    let mut metadata = source_field.metadata().clone();
+
+    // Remove any extension type metadata from source - these should not 
propagate through casts

Review Comment:
   I worry this is not well documented and we may not be consistent across 
operators
   
   What do you think about adding documentation on what the expectation for 
metadata / custom types? I feel like if we had a some good refereence we could 
then evaluate if the code matched the desired behavior. At the moment I don't 
know what the overall behavior is supposed to be



##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -127,6 +136,29 @@ impl CastExpr {
             expr,
             target_field,
             cast_options: cast_options.unwrap_or(DEFAULT_CAST_OPTIONS),
+            preserve_source_metadata: true,
+        }
+    }
+
+    /// Create a new `CastExpr` with an explicit target `FieldRef`, using the

Review Comment:
   can we instead make this a more builder style API -- like 
   ```rust
   pub fn with_preserve_source_metadata(self, preserve_source_metadata: bool) 
-> Self {
     self.preserve_source_metadata = preserve_source_metadata;
     self
   }
   ```
   
   I think that would make th callsites easier to read
   ```rust
   let cast= Cast::new(..)
     .with_preserve_source_metadata(false)
   ```



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
     -> Result<(DataType, bool)>;
 }
 
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target 
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and 
`ARROW:extension:metadata`)
+///   are taken from the target field, overwriting any extension metadata from 
the source.
+///   This ensures casting does not incorrectly propagate extension type 
identity.
+///
 /// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
 fn cast_output_field(
     source_field: &FieldRef,
-    target_type: &DataType,
+    target_field: &FieldRef,
     force_nullable: bool,
 ) -> Arc<Field> {
+    use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY, 
EXTENSION_TYPE_NAME_KEY};
+
+    // Start with source metadata
+    let mut metadata = source_field.metadata().clone();

Review Comment:
   > with custom metadata (e.g., my_custom_key) has never really worked in the 
sense that my_custom_key becomes part of the output metadata.
   
   Do you mean like adding metadata to the `Cast` expr itself?
   
   ```rust
   #[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]
   pub struct Cast {
       /// The expression being cast
       pub expr: Box<Expr>,
       /// The `DataType` the expression will yield
       pub field: FieldRef,
       /// *** New Field: custom metadata **
       pub metadata: Metadata,
   }
   
   ```
   
   We can probably achieve the same thing with a user defined function perhaps



##########
datafusion/physical-expr-adapter/src/rewrite.rs:
##########
@@ -102,7 +102,7 @@ pub fn rewrite_file_row_index_expr(
     rewrite_scalar_udf::<FileRowIndexFunc, _>(expr, |_| {
         let source = Arc::new(Column::new(row_index_name, row_index_idx));
         let target_field = Arc::new(Field::new("file_row_index", 
DataType::Int64, true));
-        Ok(Arc::new(CastExpr::new_with_target_field(
+        Ok(Arc::new(CastExpr::new_with_exact_target_field(

Review Comment:
   Can you please add a comment here about why it is preservng the source field?
   
   Or is there some way in this code to calculate the desired output metadata 
here (and attach it to the target field) ?
   
   



##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
     -> Result<(DataType, bool)>;
 }
 
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target 
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and 
`ARROW:extension:metadata`)
+///   are taken from the target field, overwriting any extension metadata from 
the source.
+///   This ensures casting does not incorrectly propagate extension type 
identity.
+///
 /// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
 fn cast_output_field(
     source_field: &FieldRef,
-    target_type: &DataType,
+    target_field: &FieldRef,
     force_nullable: bool,
 ) -> Arc<Field> {
+    use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY, 
EXTENSION_TYPE_NAME_KEY};
+
+    // Start with source metadata
+    let mut metadata = source_field.metadata().clone();
+
+    // Remove any extension type metadata from source - these should not 
propagate through casts

Review Comment:
   We could do this as a separate PR



##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -63,6 +64,10 @@ pub struct CastExpr {
     target_field: FieldRef,
     /// Cast options
     cast_options: CastOptions<'static>,
+    /// Whether to preserve non-extension metadata from the source field.
+    /// When true (default), source metadata is merged with target metadata.
+    /// When false, only the target field's metadata is used.
+    preserve_source_metadata: bool,

Review Comment:
   I had an alternate suggestion (compute the target metadata needed in the 
virtual row number code)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to