adriangb commented on code in PR #15263: URL: https://github.com/apache/datafusion/pull/15263#discussion_r1997816085
########## datafusion/datasource-parquet/src/row_filter.rs: ########## @@ -336,82 +338,40 @@ impl<'schema> PushdownChecker<'schema> { } } -impl TreeNodeRewriter for PushdownChecker<'_> { +impl TreeNodeVisitor<'_> for PushdownChecker<'_> { type Node = Arc<dyn PhysicalExpr>; - fn f_down( - &mut self, - node: Arc<dyn PhysicalExpr>, - ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { + fn f_down(&mut self, node: &Self::Node) -> Result<TreeNodeRecursion> { if let Some(column) = node.as_any().downcast_ref::<Column>() { if let Some(recursion) = self.check_single_column(column.name()) { - return Ok(Transformed::new(node, false, recursion)); - } - } - - Ok(Transformed::no(node)) - } - - /// After visiting all children, rewrite column references to nulls if - /// they are not in the file schema. - /// We do this because they won't be relevant if they're not in the file schema, since that's - /// the only thing we're dealing with here as this is only used for the parquet pushdown during - /// scanning - fn f_up( - &mut self, - expr: Arc<dyn PhysicalExpr>, - ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { Review Comment: While this would be more efficient than the new system for the case of missing columns since it avoids generating a column of nulls altogether the point is that it's not correct since it assumes that the default SchemaAdapter is being used but that's a pluggable trait. I do wonder if there's something that should be done to optimize for the default case. Not just here, but even more importantly at the stats level: even more efficient than pruning here would be to inject stats of `null_count = row_count` at the row group stats level which would prune much earlier and cheaper. That would rely on the same assumption though. Maybe https://github.com/apache/datafusion/issues/15220 can introduce an API to ask a SchemaAdapter for optional stats on columns it may generate? If we know the stats ahead of time we can prune at the stats level. If we don't then we can't prune here anyway. So I don't see any cons to removing this in this PR unless we want to stick with the current buggy behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org