Re: [PR] Support "A column is known to be entirely NULL" in `PruningPredicate` [arrow-datafusion]

via GitHub Thu, 14 Mar 2024 10:03:28 -0700


alamb commented on code in PR #9223:
URL: https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1525201435



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -903,6 +1004,42 @@ impl<'a> PruningExpressionBuilder<'a> {
         self.required_columns
             .max_column_expr(&self.column, &self.column_expr, self.field)
     }
+
+    /// Note that this function intentionally overwrites the column expression 
to [`phys_expr::Column`].
+    /// i.e. expressions like [`phys_expr::CastExpr`] or 
[`phys_expr::TryCastExpr`] will be overwritten.
+    ///
+    /// This is to avoid cases like `cast(x_null_count)` or 
`try_cast(x_null_count)`.

Review Comment:
   I don't understand this comment as I don't see any overwriting happening -- 
instead it is adding a new entry to the required columns and returning the 
relevant expression, right?



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -318,6 +327,19 @@ pub trait PruningStatistics {
 /// `x = 5 AND y = 10` | `x_min <= 5 AND 5 <= x_max AND y_min <= 10 AND 10 <= 
y_max`
 /// `x IS NULL`  | `x_null_count > 0`
 ///
+/// In addition, for a given column `x`, the `x_null_count` and `x_row_count` 
will
+/// be wrapped around the above rewritten predicate to form the final 
rewritten predicate.
+/// This step is necessary to handle the case where the column `x` is kown to 
be all nulls,
+/// This is different from knowing nothing about the column `x`, which 
confusionly is
+/// enconded by returning `NULL` for the min/max values from 
[`PruningStatistics::min_values`].

Review Comment:
   There are some misspellings in this paragraph and I think we can make it 
more concise. Here is a suggestion:
   
   ```suggestion
   /// In addition, for a given column `x`, the `x_null_count` and 
`x_row_count` will
   /// be compared using a `CASE` statement to wrap the rewritten predicate  to 
handle 
   /// the case where the column `x` is known to be all `NULL`s. Note this
   /// is different from knowing nothing about the column `x`, which 
confusingly is
   /// encoded by returning `NULL` for the min/max values from 
[`PruningStatistics::min_values`].
   ```



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -2819,14 +3171,77 @@ mod tests {
         let expected_ret = &[false, true, true, true, false];
 
         prune_with_expr(
-            // i IS NULL, with actual null statistcs
+            // i IS NULL, with actual null statistics
             col("i").is_null(),
             &schema,
             &statistics,
             expected_ret,
         );
     }
 
+    #[test]
+    fn prune_int32_column_is_known_all_null() {
+        let (schema, statistics) = int32_setup();
+
+        // Expression "i < 0"
+        // i [-5, 5] ==> some rows could pass (must keep)
+        // i [1, 11] ==> no rows can pass (not keep)
+        // i [-11, -1] ==>  all rows must pass (must keep)
+        // i [NULL, NULL]  ==> unknown (must keep)
+        // i [1, NULL]  ==> no rows can pass (not keep)
+        let expected_ret = &[true, false, true, true, false];
+
+        prune_with_expr(
+            // i < 0
+            col("i").lt(lit(0)),
+            &schema,
+            &statistics,
+            expected_ret,
+        );
+
+        // provide row counts for each column
+        let statistics = statistics.with_row_counts(
+            "i",
+            vec![
+                Some(10), // 10 rows of data
+                Some(9),  // 9 rows of data
+                None,     // unknown row counts
+                Some(4),
+                Some(10),
+            ],
+        );
+
+        // provide null counts for each column
+        let statistics = statistics.with_null_counts(
+            "i",
+            vec![
+                Some(0), // no nulls
+                Some(1), // 1 null
+                None,    // unknown nulls
+                Some(4), // 4 nulls, which is the same as the row counts, i.e. 
this column is all null (don't keep)

Review Comment:
   👍 



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -2819,14 +3171,77 @@ mod tests {
         let expected_ret = &[false, true, true, true, false];
 
         prune_with_expr(
-            // i IS NULL, with actual null statistcs
+            // i IS NULL, with actual null statistics
             col("i").is_null(),
             &schema,
             &statistics,
             expected_ret,
         );
     }
 
+    #[test]
+    fn prune_int32_column_is_known_all_null() {
+        let (schema, statistics) = int32_setup();
+
+        // Expression "i < 0"
+        // i [-5, 5] ==> some rows could pass (must keep)
+        // i [1, 11] ==> no rows can pass (not keep)
+        // i [-11, -1] ==>  all rows must pass (must keep)
+        // i [NULL, NULL]  ==> unknown (must keep)
+        // i [1, NULL]  ==> no rows can pass (not keep)
+        let expected_ret = &[true, false, true, true, false];
+
+        prune_with_expr(
+            // i < 0
+            col("i").lt(lit(0)),
+            &schema,
+            &statistics,
+            expected_ret,
+        );
+
+        // provide row counts for each column
+        let statistics = statistics.with_row_counts(
+            "i",
+            vec![
+                Some(10), // 10 rows of data
+                Some(9),  // 9 rows of data
+                None,     // unknown row counts
+                Some(4),
+                Some(10),
+            ],
+        );
+

Review Comment:
   Can you also please test that just providing `row_counts` (but not null 
counts) change the result?
   
   Something like
   
   ```suggestion
           // pruning result is still the same if we only know row counts
           prune_with_expr(
               // i < 0
               col("i").lt(lit(0)),
               &schema,
               &statistics,
               expected_ret,
           );
   ```



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -2819,14 +3171,77 @@ mod tests {
         let expected_ret = &[false, true, true, true, false];
 
         prune_with_expr(
-            // i IS NULL, with actual null statistcs
+            // i IS NULL, with actual null statistics
             col("i").is_null(),
             &schema,
             &statistics,
             expected_ret,
         );
     }
 
+    #[test]
+    fn prune_int32_column_is_known_all_null() {
+        let (schema, statistics) = int32_setup();
+
+        // Expression "i < 0"
+        // i [-5, 5] ==> some rows could pass (must keep)
+        // i [1, 11] ==> no rows can pass (not keep)
+        // i [-11, -1] ==>  all rows must pass (must keep)
+        // i [NULL, NULL]  ==> unknown (must keep)
+        // i [1, NULL]  ==> no rows can pass (not keep)
+        let expected_ret = &[true, false, true, true, false];
+
+        prune_with_expr(
+            // i < 0
+            col("i").lt(lit(0)),
+            &schema,
+            &statistics,
+            expected_ret,
+        );
+
+        // provide row counts for each column
+        let statistics = statistics.with_row_counts(
+            "i",
+            vec![
+                Some(10), // 10 rows of data
+                Some(9),  // 9 rows of data
+                None,     // unknown row counts
+                Some(4),
+                Some(10),
+            ],
+        );
+
+        // provide null counts for each column
+        let statistics = statistics.with_null_counts(
+            "i",
+            vec![
+                Some(0), // no nulls
+                Some(1), // 1 null
+                None,    // unknown nulls
+                Some(4), // 4 nulls, which is the same as the row counts, i.e. 
this column is all null (don't keep)
+                Some(0), // 0 nulls (max=null too which means no known max)
+            ],
+        );
+
+        // Expression "i < 0" with actual null and row counts statistics
+        // col | min, max     | row counts | null counts |
+        // ----+--------------+------------+-------------+
+        //  i  | [-5, 5]      | 10         | 0           | ==> Some rows could 
pass (must keep)
+        //  i  | [1, 11]      | 9          | 1           | ==> No rows can 
pass (not keep)
+        //  i  | [-11,-1]     | Unknown    | Unknown     | ==> All rows must 
pass (must keep)
+        //  i  | [NULL, NULL] | 4          | 4           | ==> The column is 
all null (not keep)
+        //  i  | [1, NULL]    | 10         | 0           | ==> No rows can 
pass (not keep)
+        let expected_ret = &[true, false, true, false, false];

Review Comment:
   the 4th element in this array is `false` which is different than the 4th 
element when null counts aren't known . 👍 



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -326,28 +348,46 @@ pub trait PruningStatistics {
 /// LiteralGuarantees are not satisfied
 ///
 /// **Second Pass**: Evaluates the rewritten expression using the
-/// min/max/null_counts values for each column for each container. For any
+/// min/max/null_counts/row_counts values for each column for each container. 
For any
 /// container that this expression evaluates to `false`, it rules out those
 /// containers.
 ///
-/// For example, given the predicate, `x = 5 AND y = 10`, if we know `x` is
-/// between `1 and 100` and we know that `y` is between `4` and `7`, the input
-/// statistics might look like
+///
+/// ### Example 1
+/// Given the predicate, `x = 5 AND y = 10`, if we know that for a given 
container, `x` is
+/// between `1 and 100` and we know that `y` is between `4` and `7`, we know 
nothing about
+/// the null count and row count of `x` and `y`, the input statistics might 
look like:
 ///
 /// Column   | Value
 /// -------- | -----
 /// `x_min`  | `1`
 /// `x_max`  | `100`
+/// `x_null_count` | `null`
+/// `x_row_count`  | `null`
 /// `y_min`  | `4`
 /// `y_max`  | `7`
+/// `y_null_count` | `null`
+/// `y_row_count`  | `null`
 ///
 /// The rewritten predicate would look like
 ///
-/// `x_min <= 5 AND 5 <= x_max AND  y_min <= 10 AND 10 <= y_max`
+/// ```sql
+/// CASE
+///     WHEN x_null_count = x_row_count THEN false
+///     ELSE x_min <= 5 AND 5 <= x_max
+/// END
+/// AND
+/// CASE
+///     WHEN y_null_count = y_row_count THEN false
+///     ELSE y_min <= 10 AND 10 <= y_max
+/// END
+/// ```
 ///
-/// When these values are substituted in to the rewritten predicate and
+/// When these statistics values are substituted in to the rewritten predicate 
and
 /// simplified, the result is `false`:
 ///
+/// * `CASE WHEN null = null THEN false ELSE 1 <= 5 AND 5 <= 100 END AND CASE 
WHEN null = null THEN false ELSE 4 <= 10 AND 10 <= 7 END`
+/// * `null = null` is `null` which is false, so the `CASE` expression will 
use the `ELSE` clause

Review Comment:
   ```suggestion
   /// * `null = null` is `null` which is is not true, so the `CASE` expression 
will use the `ELSE` clause
   ```



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -903,6 +1004,42 @@ impl<'a> PruningExpressionBuilder<'a> {
         self.required_columns
             .max_column_expr(&self.column, &self.column_expr, self.field)
     }
+
+    /// Note that this function intentionally overwrites the column expression 
to [`phys_expr::Column`].
+    /// i.e. expressions like [`phys_expr::CastExpr`] or 
[`phys_expr::TryCastExpr`] will be overwritten.
+    ///
+    /// This is to avoid cases like `cast(x_null_count)` or 
`try_cast(x_null_count)`.
+    fn null_count_column_expr(&mut self) -> Result<Arc<dyn PhysicalExpr>> {
+        // overwrite to [`phys_expr::Column`]
+        let column_expr = Arc::new(self.column.clone()) as _;
+
+        // null_count is DataType::UInt64, which is different from the 
column's data type (i.e. self.field)
+        let null_count_field = &Field::new(self.field.name(), 
DataType::UInt64, true);
+
+        self.required_columns.null_count_column_expr(
+            &self.column,
+            &column_expr,
+            null_count_field,
+        )
+    }
+
+    /// Note that this function intentionally overwrites the column expression 
to [`phys_expr::Column`].

Review Comment:
   I have the same question about this comment



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -1320,14 +1457,56 @@ fn build_statistics_expr(
             );
         }
     };
+    let statistics_expr = wrap_case_expr(statistics_expr, expr_builder)?;
     Ok(statistics_expr)
 }
 
+/// Wrap the statistics expression in a case expression.
+/// This is necessary to handle the case where the column is known
+/// to be all nulls.
+///
+/// For example:
+///
+/// `x_min <= 10 AND 10 <= x_max`
+///
+/// will become
+///
+/// ```sql
+/// CASE
+///  WHEN x_null_count = x_row_count THEN false
+///  ELSE x_min <= 10 AND 10 <= x_max
+/// END
+/// ````
+///
+/// If the column is known to be all nulls, then the expression
+/// `x_null_count = x_row_count` will be true, which will cause the
+/// case expression to return false. Therefore, prune out the container.
+fn wrap_case_expr(

Review Comment:
   I wonder if this method might be easier to find if it were on the 
`PruningExpressionBuilder` itself?
   
   Like
   
   ```rust
   impl PruningExpressionBuilder { 
   ... 
   fn wrap_case_expr(&mut self, statistics_expr: Arc<dyn PhysicalExpr>)  -> 
Result<Arc<dyn PhysicalExpr>> { 
   ...
   }
   ```
   



##########
datafusion/sqllogictest/test_files/repartition_scan.slt:
##########
@@ -138,7 +138,7 @@ physical_plan
 SortPreservingMergeExec: [column1@0 ASC NULLS LAST]
 --CoalesceBatchesExec: target_batch_size=8192
 ----FilterExec: column1@0 != 42
-------ParquetExec: file_groups={4 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]},
 projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], 
predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != 
column1_max@1, required_guarantees=[column1 not in (42)]
+------ParquetExec: file_groups={4 groups: 
[[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414],
 
[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]},
 projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], 
predicate=column1@0 != 42, pruning_predicate=CASE WHEN column1_null_count@2 = 
column1_row_count@3 THEN false ELSE column1_min@0 != 42 OR 42 != column1_max@1 
END, required_guarantees=[column1 not in (42)]

Review Comment:
   🤔  now that the pruning predicate is getting more complex, perhaps we should 
not display it by default anymore in explain plans. Maybe we can add a config 
option (as a follow on PR) that is disabled by default 🤔 to control if it is 
displayed



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -364,6 +404,50 @@ pub trait PruningStatistics {
 /// more analysis, for example by actually reading the data and evaluating the
 /// predicate row by row.
 ///
+/// ### Example 2
+/// Given the same predicate, `x = 5 AND y = 10`, if we know that for another 
given container,
+/// `x_min` is NULL and `x_max` is NULL, `x_null_count` is `100` and 
`x_row_count` is `100`;

Review Comment:
   ```suggestion
   /// `x_min` is NULL and `x_max` is NULL (the min/max values are unknown), 
`x_null_count` is `100` and `x_row_count` is `100`;
   ```



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -2066,7 +2301,21 @@ mod tests {
         let expr = col("c1")
             .lt(lit(1))
             .and(col("c2").eq(lit(2)).or(col("c2").eq(lit(3))));
-        let expected_expr = "c1_min@0 < 1 AND (c2_min@1 <= 2 AND 2 <= c2_max@2 
OR c2_min@1 <= 3 AND 3 <= c2_max@2)";
+        let expected_expr = "\
+            CASE \
+                WHEN c1_null_count@1 = c1_row_count@2 THEN false \
+                ELSE c1_min@0 < 1 \
+            END \
+        AND (\
+                CASE \
+                    WHEN c2_null_count@5 = c2_row_count@6 THEN false \

Review Comment:
   It would be nice in the future to combine these clauses (so we didn't have 
the repeated CASE expression) but for now I think this is good enought



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support "A column is known to be entirely NULL" in `PruningPredicate` [arrow-datafusion]

Reply via email to