nuno-faria commented on PR #17197: URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3191090428
Thanks @adriangb for looking into it. I found some issues with the fix. 1. It appears to be causing a regression with the regular dynamic filter pushdown. Now more rows are being returned than what is supposed to. For example: ```sql copy (select i as k from generate_series(1, 10000000) as t(i)) to 't1.parquet'; copy (select i as k, i as v from generate_series(1, 10000000) as t(i)) to 't2.parquet'; create external table t1 stored as parquet location 't1.parquet'; create external table t2 stored as parquet location 't2.parquet'; explain analyze select * from t1 join t2 on t1.k = t2.k where v = 1; -- in each run, the number of returned rows by t1 is different DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=6854272 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=8951424 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=7902848 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=8951424 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=7902848 ``` On main, the number of rows returned by t1 is the minimum expected: ```sql DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=20480 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=20480 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=20480 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=20480 DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 1 ] output_rows=20480 ``` 2. With partitioned joins, the dynamic filter is always `true`, while I think it should be the minimum/maximum of all partial filters. For example: ```sql explain analyze select * from t1 join t2 on t1.k = t2.k where v = 1 or v = 2; -- all rows are returned DataSourceExec t1.parquet predicate=DynamicFilterPhysicalExpr [ true ] output_rows=10000000 ``` I think in this case it should be `DynamicFilterPhysicalExpr [ k@0 >= 1 AND k@0 <= 2 ]`. However, if we had `v=1 or v=10000000` the filter in this case would be `[ k@0 >= 1 AND k@0 <= 10000000 ]`, which would also return all rows. Wouldn't it be feasible to build a filter by combining the different partition expressions with OR? Like so: `[ (k@0 >= 1 AND k@0 <= 1) OR (k@0 >= 10000000 AND k@0 <= 10000000) ]`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org