Re: [PR] fix: simplify IS NULL/IS NOT NULL literals for parquet pruning [datafusion]

via GitHub Tue, 23 Dec 2025 10:31:35 -0800


pepijnve commented on code in PR #19434:
URL: https://github.com/apache/datafusion/pull/19434#discussion_r2643974414



##########
datafusion/pruning/src/pruning_predicate.rs:
##########
@@ -466,18 +466,12 @@ impl PruningPredicate {
         // In particular this unravels any `DynamicFilterPhysicalExpr`s by 
snapshotting them
         // so that PruningPredicate can work with a static expression.
         let tf = snapshot_physical_expr_opt(expr)?;
-        if tf.transformed {
-            // If we had an expression such as Dynamic(part_col < 5 and col < 
10)
-            // (this could come from something like `select * from t order by 
part_col, col, limit 10`)
-            // after snapshotting and because `DynamicFilterPhysicalExpr` 
applies child replacements to its
-            // children after snapshotting and previously 
`replace_columns_with_literals` may have been called with partition values
-            // the expression we have now is `8 < 5 and col < 10`.
-            // Thus we need as simplifier pass to get `false and col < 10` => 
`false` here.
-            let simplifier = PhysicalExprSimplifier::new(&schema);
-            expr = simplifier.simplify(tf.data)?;
-        } else {
-            expr = tf.data;
-        }
+        // Always simplify the expression to handle cases like IS NULL/IS NOT 
NULL on literals,
+        // constant expressions, and expressions that may have been 
transformed during snapshotting
+        // (e.g., DynamicFilterPhysicalExpr with partition value replacements 
like `8 < 5` => `false`).
+        // The simplifier is efficient and returns early if nothing can be 
simplified.
+        let simplifier = PhysicalExprSimplifier::new(&schema);
+        expr = simplifier.simplify(tf.data)?;

Review Comment:
   I was playing around with a simple microbenchmark that measured 
simplification of `(c1 < 100 or c1 > 200) and (c2 < 100 or c2 > 200)`. A lot of 
time was being spent trying to simplify the literals, which is a bit wasteful. 
This little addition resulted in about `-86%` on the microbenchmark. Overhead 
is then about `2.39 µs` on my laptop.
   
   @adriangb do you have some example of other typical 'complex' expression 
that would be useful to benchmark?
   
   ```
   --- a/datafusion/physical-expr/src/simplifier/mod.rs (revision 
2e160e07e993deee2ff6f93bce6d7dd5f4b7a198)
   +++ b/datafusion/physical-expr/src/simplifier/mod.rs (date 1766507550524)
   @@ -54,6 +54,10 @@
            while count < MAX_LOOP_COUNT {
                count += 1;
                let result = current_expr.transform(|node| {
   +                if node.as_any().downcast_ref::<Literal>().is_some() {
   +                    return Ok(Transformed::no(node));
   +                }
   +
                    #[cfg(test)]
                    let original_type = node.data_type(schema).unwrap();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: simplify IS NULL/IS NOT NULL literals for parquet pruning [datafusion]

Reply via email to