alamb commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070806936
########## datafusion/sqllogictest/test_files/push_down_filter.slt: ########## @@ -218,43 +219,57 @@ LOCATION 'test_files/scratch/push_down_filter/t.parquet'; query TT explain select a from t where a = '100'; ---- -logical_plan TableScan: t projection=[a], full_filters=[t.a = Int32(100)] +logical_plan +01)Filter: t.a = Int32(100) +02)--TableScan: t projection=[a], partial_filters=[t.a = Int32(100)] # The predicate should not have a column cast when the value is a valid i32 query TT explain select a from t where a != '100'; ---- -logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)] +logical_plan +01)Filter: t.a != Int32(100) +02)--TableScan: t projection=[a], partial_filters=[t.a != Int32(100)] Review Comment: I recommend we change these tests to show the physical plans (not the logical plans) as that would more accurately show the pushdown happening. Maybe also something we could do as a separate PR ########## datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt: ########## @@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a; ---- logical_plan 01)Sort: t_pushdown.a ASC NULLS LAST -02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > Int32(2)] +02)--Projection: t_pushdown.a +03)----Filter: t_pushdown.b > Int32(2) +04)------TableScan: t_pushdown projection=[a, b], partial_filters=[t_pushdown.b > Int32(2)] physical_plan 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST] 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true] -03)----DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2, pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, required_guarantees=[] +03)----CoalesceBatchesExec: target_batch_size=8192 +04)------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2 +05)--------DataSourceExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]}, projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2 Review Comment: I agree it is important to remove the coalesce / repartition Maybe we can make a separate PR to move the filter pushdown code earlier in the physical planning An alternate could be to update the filter pushdown optimizer pass somehow to remove these -- but I think it would be cleaner / easier to understand if they were never added in the first place ########## datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt: ########## @@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a; ---- logical_plan 01)Sort: t_pushdown.a ASC NULLS LAST -02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > Int32(2)] +02)--Projection: t_pushdown.a +03)----Filter: t_pushdown.b > Int32(2) +04)------TableScan: t_pushdown projection=[a, b], partial_filters=[t_pushdown.b > Int32(2)] Review Comment: I think it makes sense and is ok that the logical plans show the filter not pushed down ########## datafusion/datasource-parquet/src/source.rs: ########## @@ -589,4 +559,49 @@ impl FileSource for ParquetSource { } } } + + fn try_pushdown_filters( + &self, + fd: FilterDescription, + config: &datafusion_common::config::ConfigOptions, + ) -> datafusion_common::Result<FilterPushdownResult<Arc<dyn FileSource>>> { + let Some(file_schema) = self.file_schema.clone() else { Review Comment: maybe we file a follow on ticket -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org