alamb commented on issue #11513: URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2239617843
Thoughts on the technical details > This would also mean the introduction of new Field-compatible structure (LogicalPhysicalField) Since `LogicalPlans` already use DFSchema, maybe we could avoid introducing a `LogialSchema` at all and instead update `DFSchema` to know how to report `LogicalType` 🤔 > Open question: Should ScalarValue split into a LogicalScalarValue and a PhysicalScalarValue? Or should it just provide generic information during logical planning (e.g. its logical type, is_null(), is_zero()) without access the underlying physical representation? Another possibility wuld be to migate `ScalarValue` so it was entirely logical (the usecase for a Physical scalar value is not large). We could slowly remove the non-logical versions of ScalarValue (e.g. remove `ScalarValue::LargeString` and `ScalarValue::Dictionary`) -- this would give us some idea of where the boundary would be as well as I think improve the usability (we have for example, hit issues where `ScalarValue::StringView` were treated differently than `ScalarValue::Utf8`) I think the biggest challenge of this project will be managing the transition from Arrow DataType to LogicalTypes. You have done a great job articulating the places that would need to be changed. I think we could manage the transition over time by for example deprecating (but leaving UDF APIs in terms of DataType) > ColumnarValue enum could be extended so that functions could choose to provide their own optimised implementation for a subset of physical types and then fall back to a generic implementation that materialises the argument to known physical type. ANother possibility would be to make a function like `ColumnarValue::into_one_of` that ensured (via casting if necessary) the input was one of the types supported by the operator. For example ```rust let input: ColumnarValue = &args[0]; // get input as one of the named types, casting if necessary let input = input.into_one_of(&[DataType::Utf8View, DataType::Utf8])?; match input.data_type() { DataType::Utf8View => { /*specialized impl for StringViewArray */ }, DataType::Utf8 => { /*specialized impl for StringArray */ }, _ => unreachable!() } ``` > RecordBatches with same logical type but different physical types Another thing we could do is relax the requirement that the `DataType`s matched exactly between physical plans, and instead just require that the LogicalTypes matches (in other words an ExecutionPlan could report it generates output like `Utf8` but could actually produce output like `Utf8View` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org