alamb opened a new issue, #15035:
URL: https://github.com/apache/datafusion/issues/15035

   ### Is your feature request related to a problem or challenge?
   
   - related to https://github.com/apache/datafusion/issues/14944
   
   The usecase is a predicate like this, where month_id is an integer:
   
   ```sql
   month_id = '202502'
   ```
   
   In DataFusion this results in casting the `month_id` column to a `Utf8` and 
then is compared to the string `'202502'`
   
   This is non ideal for at least 3 reasons:
   1. Converting many rows to strings is expensive
   2. Comparing strings is much slower than comparing integers
   3. many data sources (like parquet) only handle predicates in the form of 
`<col> <op> <const>` and not `cast(<col>) <op> <const>` so these predicates can 
be pushed down
   
   Here is an example (note the predicate in the plan below is 
`CAST(foo.month_id AS Utf8) = Utf8("2024") ` rather than `foo.month_id  = 
Int32("2024")`):
   
   ```sql
   DataFusion CLI v46.0.0
   > create table foo(month_id int) as values (1), (2), (3);
   0 row(s) fetched.
   Elapsed 0.003 seconds.
   
   > explain select * from foo where month_id = '2024';
   +---------------+-------------------------------------------------------+
   | plan_type     | plan                                                  |
   +---------------+-------------------------------------------------------+
   | logical_plan  | Filter: CAST(foo.month_id AS Utf8) = Utf8("2024")     |
   |               |   TableScan: foo projection=[month_id]                |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192           |
   |               |   FilterExec: CAST(month_id@0 AS Utf8) = 2024         |
   |               |     DataSourceExec: partitions=1, partition_sizes=[1] |
   |               |                                                       |
   +---------------+-------------------------------------------------------+
   2 row(s) fetched.
   Elapsed 0.008 seconds.
   ```
   
   
   ### Describe the solution you'd like
   
   
   I would like the filter in the plan above to be `FilterExec: month_id = 
2024` (no cast on month_id)
   
   Other notes:
   1. I think this applies to `=` and `!=`
   2. I don't think it can be done for `<`, `<=`, `>=` and `>` as the semantics 
for comparing ints and strings is different than equality.
   
   
   ### Describe alternatives you've considered
   
   We already have something called `unwrap_in_cast` that does this 
transformation
   
   @jayzhan211  has moved this code in this PR
   - https://github.com/apache/datafusion/pull/15012
   
   I think once that PR merged, that would be the natural place to add this 
optimization
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to