nevi-me opened a new issue #557:
URL: https://github.com/apache/arrow-datafusion/issues/557
**Describe the bug**
If I run a query that includes a `union all` between two or more different
sources, it seems that the filters don't get passed to scans that can take
filters.
I noticed this while running a query that pulls data from CSV files and a
MongoDB database.
**To Reproduce**
I haven't tried this with Parquet + CSV, but I presume it's reproducible. To
reproduce:
* Create two datasets with the same schema, but different datafusion
supported formats. One of the formats should support filter pushdown.
* Run a query that has a `union all` and a filter outside of the union.
An example query with the New York City Yellow Cabs dataset:
```sql
select
count(*) as total_records,
count(distinct payment_type) as total_payment_types,
sum(cast(trip_distance as float)) as total_distance
from (select * from mongo_nyc union all select * from csv_nyc)
where
passenger_count > 3 and
total_amount < 20.0
```
Produces the following output:
```rust
Projection: #COUNT(UInt8(1)) AS total_records, #COUNT(DISTINCT payment_type)
AS total_payment_types, #SUM(CAST(trip_distance AS Float64)) AS total_distance
Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)), COUNT(DISTINCT
#payment_type), SUM(CAST(#trip_distance AS Float64))]]
Filter: #passenger_count Gt Int64(3) And #total_amount Lt Float64(20)
Union
Projection: #passenger_count, #trip_distance, #payment_type,
#total_amount
TableScan: mongo_nyc projection=Some([3, 4, 9, 16])
Projection: #passenger_count, #trip_distance, #payment_type,
#total_amount
TableScan: csv_nyc projection=Some([3, 4, 9, 16])
```
**Expected behavior**
I expected the filters on `passenger_count` and `total_amount` to be passed
to the `mongo_nyc` table scan. Note that `mongo_nyc` above supports exact
filter pushdown, so perhaps the `csv_nyc` should get rewritten to:
```diff
Projection: #COUNT(UInt8(1)) AS total_records, #COUNT(DISTINCT
payment_type) AS total_payment_types,
#SUM(CAST(trip_distance AS Float64)) AS total_distance
Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1)), COUNT(DISTINCT
#payment_type), SUM(CAST(#trip_distance AS Float64))]]
- Filter: #passenger_count Gt Int64(3) And #total_amount Lt Float64(20)
Union
Projection: #passenger_count, #trip_distance, #payment_type,
#total_amount
- TableScan: mongo_nyc projection=Some([3, 4, 9, 16])
+ TableScan: mongo_nyc projection=Some([3, 4, 9, 16]),
filters=[#passenger_count Gt Int64(3), #total_amount Lt Float64(20)]
Projection: #passenger_count, #trip_distance, #payment_type,
#total_amount
+ Filter: #passenger_count Gt Int64(3) And #total_amount Lt
Float64(20)
TableScan: csv_nyc projection=Some([3, 4, 9, 16])
```
**Additional context**
I tested this on master as at 13 June 2021
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]