alamb commented on code in PR #15263: URL: https://github.com/apache/datafusion/pull/15263#discussion_r1999523980
########## datafusion/core/src/datasource/physical_plan/parquet.rs: ########## @@ -224,6 +224,64 @@ mod tests { ) } + #[tokio::test] + async fn test_pushdown_with_missing_column_in_file() { + let c1: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3])); + + let file_schema = + Arc::new(Schema::new(vec![Field::new("c1", DataType::Int32, true)])); + + let table_schema = Arc::new(Schema::new(vec![ + Field::new("c1", DataType::Int32, true), + Field::new("c2", DataType::Int32, true), Review Comment: Can you please also add tests (or ensure they already exist) for: 1. When the type of c1 is different than c2 (so like `Utf8` for example) 2. When the missing column is in the middle of the schema (so like a table schema with `c1, c2, c3` and a file schema like `c1, c3` or something) 3. When the file schema has columns out of order (so like a table schema with `c1, c2, c3` and a file schema like `c3, c3`) ########## datafusion/datasource-parquet/src/row_filter.rs: ########## @@ -537,12 +464,20 @@ pub fn build_row_filter( // `a = 1 AND b = 2 AND c = 3` -> [`a = 1`, `b = 2`, `c = 3`] let predicates = split_conjunction(expr); + let file_schema = Arc::new(file_schema.clone()); + let table_schema = Arc::new(table_schema.clone()); Review Comment: I think you can avoid cloning the schema with a pretty simple change. Here is a proposal: - https://github.com/pydantic/datafusion/pull/9 ########## datafusion/datasource-parquet/src/row_filter.rs: ########## @@ -336,82 +338,40 @@ impl<'schema> PushdownChecker<'schema> { } } -impl TreeNodeRewriter for PushdownChecker<'_> { +impl TreeNodeVisitor<'_> for PushdownChecker<'_> { type Node = Arc<dyn PhysicalExpr>; - fn f_down( - &mut self, - node: Arc<dyn PhysicalExpr>, - ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { + fn f_down(&mut self, node: &Self::Node) -> Result<TreeNodeRecursion> { if let Some(column) = node.as_any().downcast_ref::<Column>() { if let Some(recursion) = self.check_single_column(column.name()) { - return Ok(Transformed::new(node, false, recursion)); - } - } - - Ok(Transformed::no(node)) - } - - /// After visiting all children, rewrite column references to nulls if - /// they are not in the file schema. - /// We do this because they won't be relevant if they're not in the file schema, since that's - /// the only thing we're dealing with here as this is only used for the parquet pushdown during - /// scanning - fn f_up( - &mut self, - expr: Arc<dyn PhysicalExpr>, - ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { Review Comment: I do think in general we need to be correct first, then fast. As someone once told me "if you don't constraint it (the compiler) to be corrrect, I'll make it as fast as you want!" ########## datafusion/datasource-parquet/src/row_filter.rs: ########## @@ -336,82 +338,40 @@ impl<'schema> PushdownChecker<'schema> { } } -impl TreeNodeRewriter for PushdownChecker<'_> { +impl TreeNodeVisitor<'_> for PushdownChecker<'_> { type Node = Arc<dyn PhysicalExpr>; - fn f_down( - &mut self, - node: Arc<dyn PhysicalExpr>, - ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { + fn f_down(&mut self, node: &Self::Node) -> Result<TreeNodeRecursion> { if let Some(column) = node.as_any().downcast_ref::<Column>() { if let Some(recursion) = self.check_single_column(column.name()) { - return Ok(Transformed::new(node, false, recursion)); - } - } - - Ok(Transformed::no(node)) - } - - /// After visiting all children, rewrite column references to nulls if - /// they are not in the file schema. - /// We do this because they won't be relevant if they're not in the file schema, since that's - /// the only thing we're dealing with here as this is only used for the parquet pushdown during - /// scanning - fn f_up( - &mut self, - expr: Arc<dyn PhysicalExpr>, - ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { Review Comment: I agree adding an API for stats on the new column would be 💯 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org