paleolimbot opened a new issue, #20748:
URL: https://github.com/apache/datafusion/issues/20748

   ### Is your feature request related to a problem or challenge?
   
   (In a recent community sync I offered to review cast usage with an eye 
towards supporting customizable rules a la Spark and/or extension type 
supported casting)
   
   After #18136 we can represent casts to an extension type in the logical plan 
and Substrait, and after #20676 we will be able to represent casts to an 
extension type in SQL. While these can now be intercepted by an optimizer rule 
or logical plan modification, they will currently error if passed to the 
default planner. In both of these contexts we have `ConfigOptions` and/or a 
session, so custom cast rules (e.g., 
https://github.com/apache/datafusion/issues/11201 ) or extension type-aware 
casts (e.g., discussed in https://github.com/apache/datafusion/pull/20312 ) 
could be looked up or applied.
   
   While there are many places that create logical or physical casts whose 
execution is handled during the normal sequence of expression evaluation, there 
are a few high traffic places that depend on casting outside execution:
   
   - `Signature`s that "coerce" (i.e. implicitly cast with special rules) 
values before they are passed to UDFs
   - `ColumnarValue::cast_to()` and `ScalarValue::cast_to()`, used within many 
scalar functions (some of them could probably use the signature and possibly 
predate some recent improvements)
   - Raw calls to `arrow::compute::cast()`
   
   Casting implementations are mostly funnelled through 
`arrow::compute::cast()`; however, there are a number of custom tweaks and/or 
certain types of casts that affect eligibility for certain optimizations:
   
   - `try_cast_literal_to_type()`: 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/casts.rs
   - The physical cast expression has a list of types: 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/physical-expr/src/expressions/cast.rs#L118-L144
   - `ScalarValue::cast_to()` has accumulated a few modifications on top of the 
arrow cast kernel: 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/common/src/scalar/mod.rs#L3888-L3920
   - `ColumnarValue::cast_to()` has the same modifications except with an 
independent implementation (?) of timestamp bounds ( 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/columnar_value.rs#L277-L304
 )
   - `cast_struct_column()`, which is used for the last two plus the 
`PhysicalExprAdapter` for name-based struct casting ( 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/common/src/nested_struct.rs#L26-L53
 )
   - Intervals probably should have a custom cast to ensure that narrowing 
casts preserve containment although I didn't see that here: 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/interval_arithmetic.rs#L423-L433
   - "Type coercion", which is sort of like an "implicit" cast (or a cast with 
a cost of 0 in DuckDB terms): 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr/src/type_coercion/functions.rs
 / 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/expr-common/src/signature.rs#L1037-L1052
   - "default cast for" in the LogicalType: 
https://github.com/apache/datafusion/blob/678d1ad7f4590e74e7bae0326292949617da0f57/datafusion/common/src/types/native.rs#L275-L281
   
   Many of these have a short circuit along the lines of:
   
   ```rust
   // If types are already equal, no cast needed
   if array.data_type() == cast_type {
       return Ok(Arc::clone(array));
   }
   ```
   
   ...which is problematic in the context of extension types as we don't have 
anything that can actually check whether two extension types are "equal" 
(equality of field refs is too strict; storage type equality is too lax) until 
we have an extension type registry.
   
   ### Describe the solution you'd like
   
   I would personally love some consolidation around casting so that a wider 
range of internal casts can consider extension metadata. A motivating use case 
might be to support the new datetime extension type as many of the departures 
from `arrow::compute::cast()` are related to datetime handling.
   
   A good first step might be to create a drop-in replacement to 
`arrow::compute::cast()` in `datafusion-common` whose main purpose would be to 
track usage (and perhaps consolidate some of the specific datetime-related 
departures from `arrow::compute::cast()`).
   
   A good next step might be to create a drop-in replacement that operates on 
`Field`s or `FieldRef`s (we kind of have this in datafusion-common already to 
handle special rules around casting structs).
   
   After these are consolidated and we have a way to look up registered casts 
for extension types, we can perhaps add a variant that pipes through a 
`ConfigOptions` or session reference (which would let those internal casts 
behave identically to cast expressions).
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to