nevi-me opened a new issue #1116:
URL: https://github.com/apache/arrow-datafusion/issues/1116


   **Describe the bug**
   
   If using a data source that supports exact filters, we can duplicate filters 
if not all filters can be pushed down to the source.
   This happens if we have multiple filters on the same column, but one or more 
of those filters cannot be pushed to source.
   
   **To Reproduce**
   
   Using https://github.com/TheDataEngine/datafusion-mongo-connector, and the 
below SQL query on the NYC dataset:
   
   ```sql
   select 
   count(*) as total_records,
   VendorID as vid,
   sum(cast(trip_distance as float)) as total_distance
   from mongo_nyc
   where 
       passenger_count > 3 and 
       cast(trip_distance as float) < 5.00 and
       fare_amount / (total_amount + 0.001) > 0.70 and
       total_amount < 20.0 and 
       passenger_count is not null and 
       -passenger_count < -2 and 
       VendorID in ('2', '4')
   group by VendorID
   order by vid
   limit 100
   ```
   
   I am able to pass down the following filters in the where clause:
   
   * Gt, Lt, Lte:`passenger_count > 3` and `total_amount < 20.0`
   * Cast: `cast(trip_distance as float) < 5.00`
   * Some arithmetic: `fare_amount / (total_amount + 0.001) > 0.70`
   * IsNull, IsNotNull: `passenger_count is not null`
   * InList: `VendorID in ('2', '4')`
   
   I don't yet support pushing the below:
   
   * Neg: `-passenger_count < -2`
   
   If the query includes the above unsupported filter, the other exact filters 
are duplicated.
   
   ```rust
   Limit: 100
     Sort: #vid ASC NULLS FIRST
       Projection: #mongo_nyc.VendorID AS vid
         Filter: #mongo_nyc.passenger_count > Int64(3) AND 
#mongo_nyc.passenger_count IS NOT NULL AND (- #mongo_nyc.passenger_count) < 
Int64(-2)
           TableScan: mongo_nyc projection=Some([0, 3, 4, 10, 16]), 
filters=[#mongo_nyc.passenger_count > Int64(3), CAST(#mongo_nyc.trip_distance 
AS Float64) < Float64(5), #mongo_nyc.fare_amount / #mongo_nyc.total_amount + 
Float64(0.001) > Float64(0.7), #mongo_nyc.total_amount < Float64(20), 
#mongo_nyc.passenger_count IS NOT NULL, #mongo_nyc.VendorID IN ([Utf8("2"), 
Utf8("4")])]
   
   ```
   
   If the query excludes the above negative filter, all filters are pushed down 
to the source.
   
   ```rust
   Limit: 100
     Sort: #vid ASC NULLS FIRST
       Projection: #mongo_nyc.VendorID AS vid
         TableScan: mongo_nyc projection=Some([0, 3, 4, 10, 16]), 
filters=[#mongo_nyc.passenger_count > Int64(3), CAST(#mongo_nyc.trip_distance 
AS Float64) < Float64(5), #mongo_nyc.fare_amount / #mongo_nyc.total_amount + 
Float64(0.001) > Float64(0.7), #mongo_nyc.total_amount < Float64(20), 
#mongo_nyc.passenger_count IS NOT NULL, #mongo_nyc.VendorID IN ([Utf8("2"), 
Utf8("4")])]
   
   ```
   
   The `passenger_count` filters that are pushed to the source, are also 
evaluated by datafusion.
   
   **Expected behavior**
   
   Given that the filters are `AND`, I would expect datafusion to only evaluate 
the negated condition, as the other conditions (not null, > 3) would be 
redundant.
   
   **Additional context**
   
   I'm aware that constant folding will simplify `passenger_count > 3 and 
-passenger_count < -2` to:
   
   * `pc > 3 and pc >= 2`
   * `pc >= 2`
   
   but before then,  we are performing a few redundant calculations because of 
the duplicated filters.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to