alamb commented on code in PR #14297:
URL: https://github.com/apache/datafusion/pull/14297#discussion_r1938285557


##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -287,11 +285,6 @@ pub trait PruningStatistics {
 ///   predicate can never possibly be true). The container can be pruned 
(skipped)
 ///   entirely.
 ///
-/// While `PruningPredicate` will never return a `NULL` value, the

Review Comment:
   This is a nice simplification for sure 👍 



##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -311,12 +304,6 @@ pub trait PruningStatistics {
 ///
 /// * `true`: there MAY be rows that pass the predicate, **KEEPS** the 
container
 ///
-/// * `NULL`: there MAY be rows that pass the predicate, **KEEPS** the 
container
-///           Note that rewritten predicate can evaluate to NULL when some of
-///           the min/max values are not known. *Note that this is different 
than
-///           the SQL filter semantics where `NULL` means the row is filtered
-///           out.*
-///

Review Comment:
   Can you also please update the examples in the table below to include the 
new predicates?
   
   ```rust
   /// Here are some examples of the rewritten predicates:
   ...
   ```



##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -2785,7 +2893,7 @@ mod tests {
             vec![lit(1), lit(2), lit(3)],
             false,
         ));
-        let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 
1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 
2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= 
c1_max@1";
+        let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS 
NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 
1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR 
c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS 
NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR 
(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != 
c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL 
OR 3 <= c1_max@1)";

Review Comment:
   insta is cool for sure. Maybe you can write up a ticket describing the need. 



##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -722,8 +709,15 @@ impl BoolVecBuilder {
                 // False means all containers can not pass the predicate
                 self.inner = vec![false; self.inner.len()];
             }
+            ColumnarValue::Scalar(ScalarValue::Boolean(None)) => {
+                // this should catch bugs in tests while keeping
+                // the "old" behavior for PruningPredicate in production
+                // that if any nulls are encountered they are treated as truthy
+                #[cfg(debug_assertions)]
+                panic!("no predicate should return null!")
+            }

Review Comment:
   Maybe we can also add a `warn!` log warning here as well on release builds



##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -2475,7 +2587,7 @@ mod tests {
     fn row_group_predicate_eq() -> Result<()> {
         let schema = Schema::new(vec![Field::new("c1", DataType::Int32, 
false)]);
         let expected_expr =
-            "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= 
c1_max@1";
+            "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR 
c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND 
(c1_max@1 IS NULL OR 1 <= c1_max@1)";

Review Comment:
   I am worried about turning a simple single column `col <op> const` predicate 
like `c1_null_count@2 != c1_row_count` into a more complex one like 
`c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != 
c1_row_count@3` will make it hard for these predicates to be pushed down in 
some cases 
   
   At the moment the expressions look like this:
   ```
   expr AND expr AND ...
   ```
   
   And each `expr` is typically a simple single column `col <op> constant`
   
   After this change the expressions would be like
   
   ```
   (expr OR expr OR ...) AND (expr OR expr ...)
   ```
   
   I thought about  using `IS DISTINCT FROM` instead but I don't think that 
yields the correct value in all cases
   
   
https://github.com/apache/datafusion/blob/7f9a8c024c2993732d600bf32d3ee64ef7afa96d/datafusion/expr-common/src/operator.rs#L49-L52
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to