paleolimbot opened a new issue, #20748: URL: https://github.com/apache/datafusion/issues/20748
### Is your feature request related to a problem or challenge? (In a recent community sync I offered to review cast usage with an eye towards supporting customizable rules a la Spark and/or extension type supported casting) After #18136 we can represent casts to an extension type in the logical plan and Substrait, and after #20676 we will be able to represent casts to an extension type in SQL. While these can now be intercepted by an optimizer rule or logical plan modification, they will currently error if passed to the default planner. In both of these contexts we have `ConfigOptions` and/or a session, so custom cast rules (e.g., https://github.com/apache/datafusion/issues/11201 ) or extension type-aware casts (e.g., discussed in https://github.com/apache/datafusion/pull/20312 ) could be looked up or applied. While there are many places that create logical or physical casts whose execution is handled during the normal sequence of expression evaluation, there are a few high traffic places that depend on casting outside execution: - `Signature`s that "coerce" (i.e. implicitly cast with special rules) values before they are passed to UDFs - `ColumnarValue::cast_to()` and `ScalarValue::cast_to()`, used within many scalar functions (some of them could probably use the signature and possibly predate some recent improvements) - Raw calls to `arrow::compute::cast()` Casting implementations are mostly funnelled through `arrow::compute::cast()`; however, there are a number of custom tweaks and/or certain types of casts that affect eligibility for certain optimizations: - `try_cast_literal_to_type()`: https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/casts.rs - The physical cast expression has a list of types: https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/physical-expr/src/expressions/cast.rs#L118-L144 - `ScalarValue::cast_to()` has accumulated a few modifications on top of the arrow cast kernel: https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/common/src/scalar/mod.rs#L3888-L3920 - `ColumnarValue::cast_to()` has the same modifications except with an independent implementation (?) of timestamp bounds ( https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/columnar_value.rs#L277-L304 ) - `cast_struct_column()`, which is used for the last two plus the `PhysicalExprAdapter` for name-based struct casting ( https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/common/src/nested_struct.rs#L26-L53 ) - Intervals probably should have a custom cast to ensure that narrowing casts preserve containment although I didn't see that here: https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/interval_arithmetic.rs#L423-L433 - "Type coercion", which is sort of like an "implicit" cast (or a cast with a cost of 0 in DuckDB terms): https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr/src/type_coercion/functions.rs / https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/signature.rs#L1037-L1052 - "default cast for" in the LogicalType: https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/common/src/types/native.rs#L275-L281 Many of these have a short circuit along the lines of: ```rust // If types are already equal, no cast needed if array.data_type() == cast_type { return Ok(Arc::clone(array)); } ``` ...which is problematic in the context of extension types as we don't have anything that can actually check whether two extension types are "equal" (equality of field refs is too strict; storage type equality is too lax) until we have an extension type registry. ### Describe the solution you'd like I would personally love some consolidation around casting so that a wider range of internal casts can consider extension metadata. A motivating use case might be to support the new datetime extension type as many of the departures from `arrow::compute::cast()` are related to datetime handling. A good first step might be to create a drop-in replacement to `arrow::compute::cast()` in `datafusion-common` whose main purpose would be to track usage (and perhaps consolidate some of the specific datetime-related departures from `arrow::compute::cast()`). A good next step might be to create a drop-in replacement that operates on `Field`s or `FieldRef`s (we kind of have this in datafusion-common already to handle special rules around casting structs). After these are consolidated and we have a way to look up registered casts for extension types, we can perhaps add a variant that pipes through a `ConfigOptions` or session reference (which would let those internal casts behave identically to cast expressions). ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
