alamb commented on code in PR #23169:
URL: https://github.com/apache/datafusion/pull/23169#discussion_r3507474686
##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
-> Result<(DataType, bool)>;
}
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and
`ARROW:extension:metadata`)
+/// are taken from the target field, overwriting any extension metadata from
the source.
+/// This ensures casting does not incorrectly propagate extension type
identity.
+///
/// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
fn cast_output_field(
source_field: &FieldRef,
- target_type: &DataType,
+ target_field: &FieldRef,
force_nullable: bool,
) -> Arc<Field> {
+ use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY,
EXTENSION_TYPE_NAME_KEY};
+
+ // Start with source metadata
+ let mut metadata = source_field.metadata().clone();
+
+ // Remove any extension type metadata from source - these should not
propagate through casts
+ metadata.remove(EXTENSION_TYPE_NAME_KEY);
+ metadata.remove(EXTENSION_TYPE_METADATA_KEY);
+
+ // Add extension type metadata from the target field if present
+ let target_metadata = target_field.metadata();
+ if let Some(name) = target_metadata.get(EXTENSION_TYPE_NAME_KEY) {
+ metadata.insert(EXTENSION_TYPE_NAME_KEY.to_string(), name.clone());
+ }
+ if let Some(ext_meta) = target_metadata.get(EXTENSION_TYPE_METADATA_KEY) {
+ metadata.insert(EXTENSION_TYPE_METADATA_KEY.to_string(),
ext_meta.clone());
+ }
+
Review Comment:
It feels strange to me that when casting to a new Field that the metadata is
taken from the *source*
I think I would personally expect it to come from the target Field (as we
are casting "to" that field) -- though I see that is not what the current code
does
##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
-> Result<(DataType, bool)>;
}
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target
fields.
+///
+/// Metadata handling:
Review Comment:
Since this is an internal function, I don't think this documentation will be
easy to find -- that is not to say it isn't good to have. However I think we
should probably also make a more authoritative docs
For example, it would be nice to link to this detail from here
https://github.com/apache/datafusion/blob/9e8dd76d6deb6736c51962d9c97e04be4e3f1fc9/datafusion/expr/src/expr_schema.rs#L458
##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
-> Result<(DataType, bool)>;
}
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and
`ARROW:extension:metadata`)
+/// are taken from the target field, overwriting any extension metadata from
the source.
+/// This ensures casting does not incorrectly propagate extension type
identity.
+///
/// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
fn cast_output_field(
source_field: &FieldRef,
- target_type: &DataType,
+ target_field: &FieldRef,
force_nullable: bool,
) -> Arc<Field> {
+ use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY,
EXTENSION_TYPE_NAME_KEY};
+
+ // Start with source metadata
+ let mut metadata = source_field.metadata().clone();
+
+ // Remove any extension type metadata from source - these should not
propagate through casts
Review Comment:
I worry this is not well documented and we may not be consistent across
operators
What do you think about adding documentation on what the expectation for
metadata / custom types? I feel like if we had a some good refereence we could
then evaluate if the code matched the desired behavior. At the moment I don't
know what the overall behavior is supposed to be
##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -127,6 +136,29 @@ impl CastExpr {
expr,
target_field,
cast_options: cast_options.unwrap_or(DEFAULT_CAST_OPTIONS),
+ preserve_source_metadata: true,
+ }
+ }
+
+ /// Create a new `CastExpr` with an explicit target `FieldRef`, using the
Review Comment:
can we instead make this a more builder style API -- like
```rust
pub fn with_preserve_source_metadata(self, preserve_source_metadata: bool)
-> Self {
self.preserve_source_metadata = preserve_source_metadata;
self
}
```
I think that would make th callsites easier to read
```rust
let cast= Cast::new(..)
.with_preserve_source_metadata(false)
```
##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
-> Result<(DataType, bool)>;
}
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and
`ARROW:extension:metadata`)
+/// are taken from the target field, overwriting any extension metadata from
the source.
+/// This ensures casting does not incorrectly propagate extension type
identity.
+///
/// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
fn cast_output_field(
source_field: &FieldRef,
- target_type: &DataType,
+ target_field: &FieldRef,
force_nullable: bool,
) -> Arc<Field> {
+ use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY,
EXTENSION_TYPE_NAME_KEY};
+
+ // Start with source metadata
+ let mut metadata = source_field.metadata().clone();
Review Comment:
> with custom metadata (e.g., my_custom_key) has never really worked in the
sense that my_custom_key becomes part of the output metadata.
Do you mean like adding metadata to the `Cast` expr itself?
```rust
#[derive(Clone, PartialEq, Eq, PartialOrd, Hash, Debug)]
pub struct Cast {
/// The expression being cast
pub expr: Box<Expr>,
/// The `DataType` the expression will yield
pub field: FieldRef,
/// *** New Field: custom metadata **
pub metadata: Metadata,
}
```
We can probably achieve the same thing with a user defined function perhaps
##########
datafusion/physical-expr-adapter/src/rewrite.rs:
##########
@@ -102,7 +102,7 @@ pub fn rewrite_file_row_index_expr(
rewrite_scalar_udf::<FileRowIndexFunc, _>(expr, |_| {
let source = Arc::new(Column::new(row_index_name, row_index_idx));
let target_field = Arc::new(Field::new("file_row_index",
DataType::Int64, true));
- Ok(Arc::new(CastExpr::new_with_target_field(
+ Ok(Arc::new(CastExpr::new_with_exact_target_field(
Review Comment:
Can you please add a comment here about why it is preservng the source field?
Or is there some way in this code to calculate the desired output metadata
here (and attach it to the target field) ?
##########
datafusion/expr/src/expr_schema.rs:
##########
@@ -69,18 +69,43 @@ pub trait ExprSchemable {
-> Result<(DataType, bool)>;
}
-/// Derives the output field for a cast expression from the source field.
+/// Derives the output field for a cast expression from the source and target
fields.
+///
+/// Metadata handling:
+/// - Source field metadata is propagated by default
+/// - Extension type metadata keys (`ARROW:extension:name` and
`ARROW:extension:metadata`)
+/// are taken from the target field, overwriting any extension metadata from
the source.
+/// This ensures casting does not incorrectly propagate extension type
identity.
+///
/// For `TryCast`, `force_nullable` is `true` since a failed cast returns NULL.
fn cast_output_field(
source_field: &FieldRef,
- target_type: &DataType,
+ target_field: &FieldRef,
force_nullable: bool,
) -> Arc<Field> {
+ use arrow_schema::extension::{EXTENSION_TYPE_METADATA_KEY,
EXTENSION_TYPE_NAME_KEY};
+
+ // Start with source metadata
+ let mut metadata = source_field.metadata().clone();
+
+ // Remove any extension type metadata from source - these should not
propagate through casts
Review Comment:
We could do this as a separate PR
##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -63,6 +64,10 @@ pub struct CastExpr {
target_field: FieldRef,
/// Cast options
cast_options: CastOptions<'static>,
+ /// Whether to preserve non-extension metadata from the source field.
+ /// When true (default), source metadata is merged with target metadata.
+ /// When false, only the target field's metadata is used.
+ preserve_source_metadata: bool,
Review Comment:
I had an alternate suggestion (compute the target metadata needed in the
virtual row number code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]