adriangb opened a new issue, #15780: URL: https://github.com/apache/datafusion/issues/15780
### Describe the bug Consider the following test: ```sql COPY ( SELECT arrow_cast(a, 'Int16') AS a FROM ( VALUES (1), (2), (3) ) AS t(a) ) TO 'test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet' STORED AS PARQUET; set datafusion.execution.parquet.pushdown_filters = true; CREATE EXTERNAL TABLE t_pushdown(a int) STORED AS PARQUET LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/'; select * from t_pushdown where a = arrow_cast(2, 'Int8'); ``` At some point `DataFusion` optimizes the `Int8` filter by casting the _filter_ to `Int32` (matching the table schema, thus avoiding having to cast the column). So when the filter gets into `ParquetSource` it's an `Int32` filter. But when we read the file schema it's actually an `Int8`! Since we now build pruning predicates, etc. on a per-file basis using the physical file schema this can introduce casting of the data from `Int8` to `Int32` which is unnecessary because (1) we could cast the filter instead which would be much cheaper and (2) if the file type and filter type were both `Int8` or `Int16` in this example (as might happen if one changes the table schema but not old data or old queries) we would actually be _closer_ to the original intent of the query. This applies not only to stats filtering (where the impact is likely negligible) but also to _predicate pushdown_ where I expect the impact may be much larger especially for cases where we never end up materializing the columns (and thus don't have to cast them to the table's data type at all). I don't know that any benchmark measures this case at the moment though. To resolve this I think we just need to call `optimize_casts(physical_expr, physical_file_schema)` (a made up function) but I don't know where or how`optimize_casts` exists (I feel like it must already exist, maybe it's at the logical expr level?). ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org