alamb commented on issue #11513:
URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2239617843
Thoughts on the technical details
> This would also mean the introduction of new Field-compatible structure
(LogicalPhysicalField)
Since `LogicalPlans` already use DFSchema, maybe we could avoid introducing
a `LogialSchema` at all and instead update `DFSchema` to know how to report
`LogicalType` 🤔
> Open question: Should ScalarValue split into a LogicalScalarValue and a
PhysicalScalarValue? Or should it just provide generic information during
logical planning (e.g. its logical type, is_null(), is_zero()) without access
the underlying physical representation?
Another possibility wuld be to migate `ScalarValue` so it was entirely
logical (the usecase for a Physical scalar value is not large). We could
slowly remove the non-logical versions of ScalarValue (e.g. remove
`ScalarValue::LargeString` and `ScalarValue::Dictionary`) -- this would give us
some idea of where the boundary would be as well as I think improve the
usability (we have for example, hit issues where `ScalarValue::StringView` were
treated differently than `ScalarValue::Utf8`)
I think the biggest challenge of this project will be managing the
transition from Arrow DataType to LogicalTypes. You have done a great job
articulating the places that would need to be changed. I think we could manage
the transition over time by for example deprecating (but leaving UDF APIs in
terms of DataType)
> ColumnarValue enum could be extended so that functions could choose to
provide their own optimised implementation for a subset of physical types and
then fall back to a generic implementation that materialises the argument to
known physical type.
ANother possibility would be to make a function like
`ColumnarValue::into_one_of` that ensured (via casting if necessary) the input
was one of the types supported by the operator. For example
```rust
let input: ColumnarValue = &args[0];
// get input as one of the named types, casting if necessary
let input = input.into_one_of(&[DataType::Utf8View, DataType::Utf8])?;
match input.data_type() {
DataType::Utf8View => { /*specialized impl for StringViewArray */ },
DataType::Utf8 => { /*specialized impl for StringArray */ },
_ => unreachable!()
}
```
> RecordBatches with same logical type but different physical types
Another thing we could do is relax the requirement that the `DataType`s
matched exactly between physical plans, and instead just require that the
LogicalTypes matches (in other words an ExecutionPlan could report it generates
output like `Utf8` but could actually produce output like `Utf8View` 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]