notfilippo opened a new pull request, #11978: URL: https://github.com/apache/datafusion/pull/11978
This PR removes LargeUtf8|Binary, Utf8|BinaryView, and Dictionary from ScalarValue, following the discussion on #11513 --- ## Open questions ### Top level ScalarValue cast This change currently fails this test (and potentially others) and before proceeding I would like to get some feedback on how to solve this issue. The test fails because of an interaction of `expression_simplifier` + `optimize_projections`. The query is: `SELECT arrow_cast(1234, 'Float64') as f64, arrow_cast('foo', 'LargeUtf8') as large` 1. expression_simplifier correctly tries to evaluate_to_scalar the `arrow_cast('foo', 'LargeUtf8')` cast ([here](https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L626-L626)) 2. this leads to the cast physical expression ([here](https://github.com/apache/datafusion/blob/main/datafusion/expr-common/src/columnar_value.rs#L187-L187)), but this is a no-op since it transforms `Utf8('foo')` (scalar) → `LargeUtf8('foo')` (array) → `Utf8('foo')` (scalar) - TODO: if this is what we want this should probably not execute at all 3. This edit doesn't change the underlying schema of the LogicalPlan 4. `optimize_projections` rewrites the Projection and updates the schema, seeing the `Utf8('foo')` it correctly assumes that the LogicalPlan's schema field for this value should have DataType == Utf8 [This check](https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/optimizer.rs#L471-L471) is the one raising this error but I guess it should instead check if schema fields are _logically_ equivalent to eachother. I'm not totally convinced this is the correct solution because it removes some guarantees that might be expected by users downstream. Happy to hear everyone's opinion on this. ``` ---- optimizer::select_arrow_cast stdout ---- thread 'optimizer::select_arrow_cast' panicked at datafusion/core/tests/optimizer/mod.rs:118:30: called `Result::unwrap()` on an `Err` value: Context("Optimizer rule 'optimize_projections' failed", Context("optimize_projections", Internal("Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: \"f64\", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"large\", data_type: LargeUtf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, None], functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: \"f64\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"large\", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, None], functional_dependencies: FunctionalDependencies { deps: [] } }"))) note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org