Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]

via GitHub Thu, 01 May 2025 14:06:07 -0700


alamb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070806936



##########
datafusion/sqllogictest/test_files/push_down_filter.slt:
##########
@@ -218,43 +219,57 @@ LOCATION 'test_files/scratch/push_down_filter/t.parquet';
 query TT
 explain select a from t where a = '100';
 ----
-logical_plan TableScan: t projection=[a], full_filters=[t.a = Int32(100)]
+logical_plan
+01)Filter: t.a = Int32(100)
+02)--TableScan: t projection=[a], partial_filters=[t.a = Int32(100)]
 
 # The predicate should not have a column cast  when the value is a valid i32
 query TT
 explain select a from t where a != '100';
 ----
-logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)]
+logical_plan
+01)Filter: t.a != Int32(100)
+02)--TableScan: t projection=[a], partial_filters=[t.a != Int32(100)]

Review Comment:
   I recommend we change these tests to show the physical plans (not the 
logical plans) as that would more accurately show the pushdown happening. Maybe 
also something we could do as a separate PR



##########
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##########
@@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a;
 ----
 logical_plan
 01)Sort: t_pushdown.a ASC NULLS LAST
-02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > 
Int32(2)]
+02)--Projection: t_pushdown.a
+03)----Filter: t_pushdown.b > Int32(2)
+04)------TableScan: t_pushdown projection=[a, b], 
partial_filters=[t_pushdown.b > Int32(2)]
 physical_plan
 01)SortPreservingMergeExec: [a@0 ASC NULLS LAST]
 02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true]
-03)----DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2, 
pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2, 
required_guarantees=[]
+03)----CoalesceBatchesExec: target_batch_size=8192
+04)------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+05)--------DataSourceExec: file_groups={2 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
 projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2

Review Comment:
   I agree it is important to remove the coalesce / repartition
   
   Maybe we can make a separate PR to move the filter pushdown code earlier in 
the physical planning
   
   An alternate could be to update the filter pushdown optimizer pass somehow 
to remove these -- but I think it would be cleaner / easier to understand if 
they were never added in the first place



##########
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##########
@@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a;
 ----
 logical_plan
 01)Sort: t_pushdown.a ASC NULLS LAST
-02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b > 
Int32(2)]
+02)--Projection: t_pushdown.a
+03)----Filter: t_pushdown.b > Int32(2)
+04)------TableScan: t_pushdown projection=[a, b], 
partial_filters=[t_pushdown.b > Int32(2)]

Review Comment:
   I think it makes sense and is ok that the logical plans show the filter not 
pushed down



##########
datafusion/datasource-parquet/src/source.rs:
##########
@@ -589,4 +559,49 @@ impl FileSource for ParquetSource {
             }
         }
     }
+
+    fn try_pushdown_filters(
+        &self,
+        fd: FilterDescription,
+        config: &datafusion_common::config::ConfigOptions,
+    ) -> datafusion_common::Result<FilterPushdownResult<Arc<dyn FileSource>>> {
+        let Some(file_schema) = self.file_schema.clone() else {

Review Comment:
   maybe we file a follow on ticket



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]

Reply via email to