nuno-faria commented on PR #17197:
URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3191090428

   Thanks @adriangb for looking into it.
   
   I found some issues with the fix.
   
   1. It appears to be causing a regression with the regular dynamic filter 
pushdown. Now more rows are being returned than what is supposed to. For 
example:
   ```sql
   copy (select i as k from generate_series(1, 10000000) as t(i)) to 
't1.parquet';
   copy (select i as k, i as v from generate_series(1, 10000000) as t(i)) to 
't2.parquet';
   create external table t1 stored as parquet location 't1.parquet';
   create external table t2 stored as parquet location 't2.parquet';
   
   explain analyze select *
   from t1
   join t2 on t1.k = t2.k
   where v = 1;
   
   -- in each run, the number of returned rows by t1 is different
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=6854272
       
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=8951424
   
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=7902848
   
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=8951424
   
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=7902848
   ```
   
   On main, the number of rows returned by t1 is the minimum expected:
   ```sql
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=20480
   
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=20480
       
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=20480
   
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=20480
       
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 1 ]
       output_rows=20480
   ```
   
   2. With partitioned joins, the dynamic filter is always `true`, while I 
think it should be the minimum/maximum of all partial filters. For example:
   ```sql
   explain analyze select *
   from t1
   join t2 on t1.k = t2.k
   where v = 1 or v = 2;
   
   -- all rows are returned
   DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ true ]
       output_rows=10000000
   ```
   
   I think in this case it should be `DynamicFilterPhysicalExpr [ k@0 >= 1 AND 
k@0 <= 2 ]`.
   
   However, if we had `v=1 or v=10000000` the filter in this case would be `[ 
k@0 >= 1 AND k@0 <= 10000000 ]`, which would also return all rows. Wouldn't it 
be feasible to build a filter by combining the different partition expressions 
with OR? Like so: `[ (k@0 >= 1 AND k@0 <= 1) OR (k@0 >= 10000000 AND k@0 <= 
10000000) ]`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to