Re: [PR] Fix predicate pushdown for custom SchemaAdapters [datafusion]

via GitHub Sun, 16 Mar 2025 18:52:54 -0700


adriangb commented on code in PR #15263:
URL: https://github.com/apache/datafusion/pull/15263#discussion_r1997816085



##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -336,82 +338,40 @@ impl<'schema> PushdownChecker<'schema> {
     }
 }
 
-impl TreeNodeRewriter for PushdownChecker<'_> {
+impl TreeNodeVisitor<'_> for PushdownChecker<'_> {
     type Node = Arc<dyn PhysicalExpr>;
 
-    fn f_down(
-        &mut self,
-        node: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {
+    fn f_down(&mut self, node: &Self::Node) -> Result<TreeNodeRecursion> {
         if let Some(column) = node.as_any().downcast_ref::<Column>() {
             if let Some(recursion) = self.check_single_column(column.name()) {
-                return Ok(Transformed::new(node, false, recursion));
-            }
-        }
-
-        Ok(Transformed::no(node))
-    }
-
-    /// After visiting all children, rewrite column references to nulls if
-    /// they are not in the file schema.
-    /// We do this because they won't be relevant if they're not in the file 
schema, since that's
-    /// the only thing we're dealing with here as this is only used for the 
parquet pushdown during
-    /// scanning
-    fn f_up(
-        &mut self,
-        expr: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {

Review Comment:
   While this would be more efficient than the new system for the case of 
missing columns since it avoids generating a column of nulls altogether the 
point is that it's not correct since it assumes that the default SchemaAdapter 
is being used but that's a pluggable trait.
   
   I do wonder if there's something that should be done to optimize for the 
default case. Not just here, but even more importantly at the stats level: even 
more efficient than pruning here would be to inject stats of `null_count = 
row_count` at the row group stats level which would prune much earlier and 
cheaper. That would rely on the same assumption though. Maybe 
https://github.com/apache/datafusion/issues/15220 can introduce an API to ask a 
SchemaAdapter for optional stats on columns it may generate? If we know the 
stats ahead of time we can prune at the stats level. If we don't then we can't 
prune here anyway. So I don't see any cons to removing this in this PR unless 
we want to stick with the current buggy behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix predicate pushdown for custom SchemaAdapters [datafusion]

Reply via email to