adriangb opened a new issue, #15780:
URL: https://github.com/apache/datafusion/issues/15780

   ### Describe the bug
   
   Consider the following test:
   
   ```sql
   COPY  (
     SELECT arrow_cast(a, 'Int16') AS a
     FROM ( VALUES (1), (2), (3) ) AS t(a)
   )  TO 'test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet'
   STORED AS PARQUET;
   
   set datafusion.execution.parquet.pushdown_filters = true;
   
   CREATE EXTERNAL TABLE t_pushdown(a int) STORED AS PARQUET
   LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/';
   
   select * from t_pushdown where a = arrow_cast(2, 'Int8');
   ```
   
   At some point `DataFusion` optimizes the `Int8` filter by casting the 
_filter_ to `Int32` (matching the table schema, thus avoiding having to cast 
the column).
   
   So when the filter gets into `ParquetSource` it's an `Int32` filter. But 
when we read the file schema it's actually an `Int8`! Since we now build 
pruning predicates, etc. on a per-file basis using the physical file schema 
this can introduce casting of the data from `Int8` to `Int32` which is 
unnecessary because (1) we could cast the filter instead which would be much 
cheaper and (2) if the file type and filter type were both `Int8` or `Int16` in 
this example (as might happen if one changes the table schema but not old data 
or old queries) we would actually be _closer_ to the original intent of the 
query.
   
   This applies not only to stats filtering (where the impact is likely 
negligible) but also to _predicate pushdown_ where I expect the impact may be 
much larger especially for cases where we never end up materializing the 
columns (and thus don't have to cast them to the table's data type at all). I 
don't know that any benchmark measures this case at the moment though.
   
   To resolve this I think we just need to call `optimize_casts(physical_expr, 
physical_file_schema)` (a made up function) but I don't know where or 
how`optimize_casts` exists (I feel like it must already exist, maybe it's at 
the logical expr level?).
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to