Re: [PR] Fix predicate pushdown for custom SchemaAdapters [datafusion]

via GitHub Mon, 17 Mar 2025 13:28:47 -0700


alamb commented on code in PR #15263:
URL: https://github.com/apache/datafusion/pull/15263#discussion_r1999523980



##########
datafusion/core/src/datasource/physical_plan/parquet.rs:
##########
@@ -224,6 +224,64 @@ mod tests {
         )
     }
 
+    #[tokio::test]
+    async fn test_pushdown_with_missing_column_in_file() {
+        let c1: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
+
+        let file_schema =
+            Arc::new(Schema::new(vec![Field::new("c1", DataType::Int32, 
true)]));
+
+        let table_schema = Arc::new(Schema::new(vec![
+            Field::new("c1", DataType::Int32, true),
+            Field::new("c2", DataType::Int32, true),

Review Comment:
   Can you please also add tests (or ensure they already exist) for:
   1. When the type of c1 is different than c2 (so like `Utf8` for example)
   2. When the missing column is in the middle of the schema (so like a table 
schema with `c1, c2, c3` and a file schema like `c1, c3` or something)
   3. When the file schema has columns out of order (so like a table schema 
with `c1, c2, c3` and a file schema like `c3, c3`)



##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -537,12 +464,20 @@ pub fn build_row_filter(
     // `a = 1 AND b = 2 AND c = 3` -> [`a = 1`, `b = 2`, `c = 3`]
     let predicates = split_conjunction(expr);
 
+    let file_schema = Arc::new(file_schema.clone());
+    let table_schema = Arc::new(table_schema.clone());

Review Comment:
   I think you can avoid cloning the schema with a pretty simple change. Here 
is a proposal:
   - https://github.com/pydantic/datafusion/pull/9



##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -336,82 +338,40 @@ impl<'schema> PushdownChecker<'schema> {
     }
 }
 
-impl TreeNodeRewriter for PushdownChecker<'_> {
+impl TreeNodeVisitor<'_> for PushdownChecker<'_> {
     type Node = Arc<dyn PhysicalExpr>;
 
-    fn f_down(
-        &mut self,
-        node: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {
+    fn f_down(&mut self, node: &Self::Node) -> Result<TreeNodeRecursion> {
         if let Some(column) = node.as_any().downcast_ref::<Column>() {
             if let Some(recursion) = self.check_single_column(column.name()) {
-                return Ok(Transformed::new(node, false, recursion));
-            }
-        }
-
-        Ok(Transformed::no(node))
-    }
-
-    /// After visiting all children, rewrite column references to nulls if
-    /// they are not in the file schema.
-    /// We do this because they won't be relevant if they're not in the file 
schema, since that's
-    /// the only thing we're dealing with here as this is only used for the 
parquet pushdown during
-    /// scanning
-    fn f_up(
-        &mut self,
-        expr: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {

Review Comment:
   I do think in general we need to be correct first, then fast. 
   
   As someone once told me "if you don't constraint it (the compiler) to be 
corrrect, I'll make it as fast as you want!"



##########
datafusion/datasource-parquet/src/row_filter.rs:
##########
@@ -336,82 +338,40 @@ impl<'schema> PushdownChecker<'schema> {
     }
 }
 
-impl TreeNodeRewriter for PushdownChecker<'_> {
+impl TreeNodeVisitor<'_> for PushdownChecker<'_> {
     type Node = Arc<dyn PhysicalExpr>;
 
-    fn f_down(
-        &mut self,
-        node: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {
+    fn f_down(&mut self, node: &Self::Node) -> Result<TreeNodeRecursion> {
         if let Some(column) = node.as_any().downcast_ref::<Column>() {
             if let Some(recursion) = self.check_single_column(column.name()) {
-                return Ok(Transformed::new(node, false, recursion));
-            }
-        }
-
-        Ok(Transformed::no(node))
-    }
-
-    /// After visiting all children, rewrite column references to nulls if
-    /// they are not in the file schema.
-    /// We do this because they won't be relevant if they're not in the file 
schema, since that's
-    /// the only thing we're dealing with here as this is only used for the 
parquet pushdown during
-    /// scanning
-    fn f_up(
-        &mut self,
-        expr: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {

Review Comment:
   I agree adding an API for stats on the new column would be 💯 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix predicate pushdown for custom SchemaAdapters [datafusion]

Reply via email to