alamb commented on issue #2581: URL: https://github.com/apache/datafusion/issues/2581#issuecomment-2571424336
> There are also real optimizations available here. For example, suppose I write an Arrow int8 column to Parquet. The Arrow schema is serialized into Parquet metadata so at read time the column is read back as int8. If a scalar expression tries to sum this column with an i32, e.g. `SELECT col + 10i32`, then DataFusion inserts an upcast. Today, this results in decoding the Parquet column (whose smallest physical integer type is int32) into an Arrow int32 array, then [casting to an int8](https://github.com/apache/arrow-rs/blob/b77d38d022079b106449ead3996f373edc906327/parquet/src/arrow/array_reader/primitive_array.rs#L273), then DataFusion casting back to an int32. I tried to reproduce the issue you described and I could not Specifically, I think in this case DataFusion actully casts the `10` to `Int8` and evaluate that directly against the contents of the column. Here is what I tried: ```sql DataFusion CLI v44.0.0 > copy (select arrow_cast(1, 'Int8') as x) to '/tmp/foo.parquet'; +-------+ | count | +-------+ | 1 | +-------+ 1 row(s) fetched. Elapsed 0.062 seconds. > describe '/tmp/foo.parquet'; +-------------+-----------+-------------+ | column_name | data_type | is_nullable | +-------------+-----------+-------------+ | x | Int8 | NO | +-------------+-----------+-------------+ 1 row(s) fetched. Elapsed 0.012 seconds. > explain select x = arrow_cast(10, 'Int32') from '/tmp/foo.parquet'; +---------------+-------------------------------------------------------------------------------------------------------+ | plan_type | plan | +---------------+-------------------------------------------------------------------------------------------------------+ | logical_plan | Projection: /tmp/foo.parquet.x = Int8(10) AS /tmp/foo.parquet.x = arrow_cast(Int64(10),Utf8("Int32")) | | | TableScan: /tmp/foo.parquet projection=[x] | | physical_plan | ProjectionExec: expr=[x@0 = 10 as /tmp/foo.parquet.x = arrow_cast(Int64(10),Utf8("Int32"))] | | | ParquetExec: file_groups={1 group: [[tmp/foo.parquet]]}, projection=[x] | | | | +---------------+-------------------------------------------------------------------------------------------------------+ 2 row(s) fetched. Elapsed 0.017 seconds. ``` Specifically this line: > | physical_plan | ProjectionExec: expr=[x@0 = 10 as /tmp/foo.parquet.x = arrow_cast(Int64(10),Utf8("Int32"))] | the `x@0 = 10` means the `10` was cast to `Int8` to match the column type, not the other way around -- this is done by the UnwrapToCast AnalyzerPass There may be more complex examples (e.g. with nested types) where the constant cast can't be unwrapped -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org