alamb commented on code in PR #15769:
URL: https://github.com/apache/datafusion/pull/15769#discussion_r2070806936
##########
datafusion/sqllogictest/test_files/push_down_filter.slt:
##########
@@ -218,43 +219,57 @@ LOCATION 'test_files/scratch/push_down_filter/t.parquet';
query TT
explain select a from t where a = '100';
----
-logical_plan TableScan: t projection=[a], full_filters=[t.a = Int32(100)]
+logical_plan
+01)Filter: t.a = Int32(100)
+02)--TableScan: t projection=[a], partial_filters=[t.a = Int32(100)]
# The predicate should not have a column cast when the value is a valid i32
query TT
explain select a from t where a != '100';
----
-logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)]
+logical_plan
+01)Filter: t.a != Int32(100)
+02)--TableScan: t projection=[a], partial_filters=[t.a != Int32(100)]
Review Comment:
I recommend we change these tests to show the physical plans (not the
logical plans) as that would more accurately show the pushdown happening. Maybe
also something we could do as a separate PR
##########
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##########
@@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a;
----
logical_plan
01)Sort: t_pushdown.a ASC NULLS LAST
-02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b >
Int32(2)]
+02)--Projection: t_pushdown.a
+03)----Filter: t_pushdown.b > Int32(2)
+04)------TableScan: t_pushdown projection=[a, b],
partial_filters=[t_pushdown.b > Int32(2)]
physical_plan
01)SortPreservingMergeExec: [a@0 ASC NULLS LAST]
02)--SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true]
-03)----DataSourceExec: file_groups={2 groups:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
projection=[a], file_type=parquet, predicate=b@1 > 2,
pruning_predicate=b_null_count@1 != row_count@2 AND b_max@0 > 2,
required_guarantees=[]
+03)----CoalesceBatchesExec: target_batch_size=8192
+04)------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=2
+05)--------DataSourceExec: file_groups={2 groups:
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet],
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet_filter_pushdown/parquet_table/2.parquet]]},
projection=[a], file_type=parquet, predicate=b@1 > 2 AND b@1 > 2
Review Comment:
I agree it is important to remove the coalesce / repartition
Maybe we can make a separate PR to move the filter pushdown code earlier in
the physical planning
An alternate could be to update the filter pushdown optimizer pass somehow
to remove these -- but I think it would be cleaner / easier to understand if
they were never added in the first place
##########
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
##########
@@ -81,11 +81,15 @@ EXPLAIN select a from t_pushdown where b > 2 ORDER BY a;
----
logical_plan
01)Sort: t_pushdown.a ASC NULLS LAST
-02)--TableScan: t_pushdown projection=[a], full_filters=[t_pushdown.b >
Int32(2)]
+02)--Projection: t_pushdown.a
+03)----Filter: t_pushdown.b > Int32(2)
+04)------TableScan: t_pushdown projection=[a, b],
partial_filters=[t_pushdown.b > Int32(2)]
Review Comment:
I think it makes sense and is ok that the logical plans show the filter not
pushed down
##########
datafusion/datasource-parquet/src/source.rs:
##########
@@ -589,4 +559,49 @@ impl FileSource for ParquetSource {
}
}
}
+
+ fn try_pushdown_filters(
+ &self,
+ fd: FilterDescription,
+ config: &datafusion_common::config::ConfigOptions,
+ ) -> datafusion_common::Result<FilterPushdownResult<Arc<dyn FileSource>>> {
+ let Some(file_schema) = self.file_schema.clone() else {
Review Comment:
maybe we file a follow on ticket
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]