nevi-me opened a new issue #1116: URL: https://github.com/apache/arrow-datafusion/issues/1116
**Describe the bug** If using a data source that supports exact filters, we can duplicate filters if not all filters can be pushed down to the source. This happens if we have multiple filters on the same column, but one or more of those filters cannot be pushed to source. **To Reproduce** Using https://github.com/TheDataEngine/datafusion-mongo-connector, and the below SQL query on the NYC dataset: ```sql select count(*) as total_records, VendorID as vid, sum(cast(trip_distance as float)) as total_distance from mongo_nyc where passenger_count > 3 and cast(trip_distance as float) < 5.00 and fare_amount / (total_amount + 0.001) > 0.70 and total_amount < 20.0 and passenger_count is not null and -passenger_count < -2 and VendorID in ('2', '4') group by VendorID order by vid limit 100 ``` I am able to pass down the following filters in the where clause: * Gt, Lt, Lte:`passenger_count > 3` and `total_amount < 20.0` * Cast: `cast(trip_distance as float) < 5.00` * Some arithmetic: `fare_amount / (total_amount + 0.001) > 0.70` * IsNull, IsNotNull: `passenger_count is not null` * InList: `VendorID in ('2', '4')` I don't yet support pushing the below: * Neg: `-passenger_count < -2` If the query includes the above unsupported filter, the other exact filters are duplicated. ```rust Limit: 100 Sort: #vid ASC NULLS FIRST Projection: #mongo_nyc.VendorID AS vid Filter: #mongo_nyc.passenger_count > Int64(3) AND #mongo_nyc.passenger_count IS NOT NULL AND (- #mongo_nyc.passenger_count) < Int64(-2) TableScan: mongo_nyc projection=Some([0, 3, 4, 10, 16]), filters=[#mongo_nyc.passenger_count > Int64(3), CAST(#mongo_nyc.trip_distance AS Float64) < Float64(5), #mongo_nyc.fare_amount / #mongo_nyc.total_amount + Float64(0.001) > Float64(0.7), #mongo_nyc.total_amount < Float64(20), #mongo_nyc.passenger_count IS NOT NULL, #mongo_nyc.VendorID IN ([Utf8("2"), Utf8("4")])] ``` If the query excludes the above negative filter, all filters are pushed down to the source. ```rust Limit: 100 Sort: #vid ASC NULLS FIRST Projection: #mongo_nyc.VendorID AS vid TableScan: mongo_nyc projection=Some([0, 3, 4, 10, 16]), filters=[#mongo_nyc.passenger_count > Int64(3), CAST(#mongo_nyc.trip_distance AS Float64) < Float64(5), #mongo_nyc.fare_amount / #mongo_nyc.total_amount + Float64(0.001) > Float64(0.7), #mongo_nyc.total_amount < Float64(20), #mongo_nyc.passenger_count IS NOT NULL, #mongo_nyc.VendorID IN ([Utf8("2"), Utf8("4")])] ``` The `passenger_count` filters that are pushed to the source, are also evaluated by datafusion. **Expected behavior** Given that the filters are `AND`, I would expect datafusion to only evaluate the negated condition, as the other conditions (not null, > 3) would be redundant. **Additional context** I'm aware that constant folding will simplify `passenger_count > 3 and -passenger_count < -2` to: * `pc > 3 and pc >= 2` * `pc >= 2` but before then, we are performing a few redundant calculations because of the duplicated filters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org