adriangb opened a new issue, #15780:
URL: https://github.com/apache/datafusion/issues/15780
### Describe the bug
Consider the following test:
```sql
COPY (
SELECT arrow_cast(a, 'Int16') AS a
FROM ( VALUES (1), (2), (3) ) AS t(a)
) TO 'test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet'
STORED AS PARQUET;
set datafusion.execution.parquet.pushdown_filters = true;
CREATE EXTERNAL TABLE t_pushdown(a int) STORED AS PARQUET
LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/';
select * from t_pushdown where a = arrow_cast(2, 'Int8');
```
At some point `DataFusion` optimizes the `Int8` filter by casting the
_filter_ to `Int32` (matching the table schema, thus avoiding having to cast
the column).
So when the filter gets into `ParquetSource` it's an `Int32` filter. But
when we read the file schema it's actually an `Int8`! Since we now build
pruning predicates, etc. on a per-file basis using the physical file schema
this can introduce casting of the data from `Int8` to `Int32` which is
unnecessary because (1) we could cast the filter instead which would be much
cheaper and (2) if the file type and filter type were both `Int8` or `Int16` in
this example (as might happen if one changes the table schema but not old data
or old queries) we would actually be _closer_ to the original intent of the
query.
This applies not only to stats filtering (where the impact is likely
negligible) but also to _predicate pushdown_ where I expect the impact may be
much larger especially for cases where we never end up materializing the
columns (and thus don't have to cast them to the table's data type at all). I
don't know that any benchmark measures this case at the moment though.
To resolve this I think we just need to call `optimize_casts(physical_expr,
physical_file_schema)` (a made up function) but I don't know where or
how`optimize_casts` exists (I feel like it must already exist, maybe it's at
the logical expr level?).
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]