[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3454: Add more cast pruning tests, handle unsupported predicates better

GitBox Mon, 12 Sep 2022 12:00:14 -0700


alamb commented on code in PR #3454:
URL: https://github.com/apache/arrow-datafusion/pull/3454#discussion_r968811074



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -390,8 +406,13 @@ fn build_statistics_record_batch<S: PruningStatistics>(
     }
 
     let schema = Arc::new(Schema::new(fields));
-    RecordBatch::try_new(schema, arrays)
-        .map_err(|err| DataFusionError::Plan(err.to_string()))
+    // provide the count in case there were no needed statistics

Review Comment:
   Previously, if the pruning predicate did not call for any columns (b/c it 
was "true" for example) this would error because in earlier versions of 
`arrow-rs` `RecordBatch`es could not have `0` columns -- now that they can, we 
use that feature to build a 0 column RecordBatch when the predicate has no 
columns that can be transformed 



##########
datafusion/core/tests/parquet_pruning.rs:
##########
@@ -237,7 +237,7 @@ async fn prune_int32_scalar_fun() {
     test_prune(
         Scenario::Int32,
         "SELECT * FROM t where abs(i) = 1",
-        Some(4),
+        Some(0),

Review Comment:
   The changes in this file are due to the fact that these predicates no longer 
error (instead they evaluate successfully to all `true`,  meaning all record 
groups are kept)
   
   The `3` is the number of rows that pass this predicate -- so no actual 
answers are changed here



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -168,38 +170,51 @@ impl PruningPredicate {
     /// simplified version `b`. The predicates are simplified via the
     /// ConstantFolding optimizer pass
     pub fn prune<S: PruningStatistics>(&self, statistics: &S) -> 
Result<Vec<bool>> {
-        // build statistics record batch
-        let predicate_array =
-            build_statistics_record_batch(statistics, &self.required_columns)
-                .and_then(|statistics_batch| {
-                    // execute predicate expression
-                    self.predicate_expr.evaluate(&statistics_batch)
-                })
-                .and_then(|v| match v {
-                    ColumnarValue::Array(array) => Ok(array),
-                    ColumnarValue::Scalar(_) => Err(DataFusionError::Internal(
-                        "predicate expression didn't return an 
array".to_string(),
-                    )),
-                })?;
-
-        let predicate_array = predicate_array
-            .as_any()
-            .downcast_ref::<BooleanArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(format!(
-                    "Expected pruning predicate evaluation to be BooleanArray, 
\
-                     but was {:?}",
-                    predicate_array
-                ))
-            })?;
-
-        // when the result of the predicate expression for a row group is null 
/ undefined,
-        // e.g. due to missing statistics, this row group can't be filtered 
out,
-        // so replace with true
-        Ok(predicate_array
-            .into_iter()
-            .map(|x| x.unwrap_or(true))
-            .collect::<Vec<_>>())
+        // build a RecordBatch that contains the min/max values in the
+        // appropriate statistics columns
+        let statistics_batch =
+            build_statistics_record_batch(statistics, &self.required_columns)?;
+
+        // Evaluate the pruning predicate on that record batch.
+        //
+        // Use true when the result of evaluating a predicate
+        // expression on a row group is null (aka `None`). Null can
+        // arise when the statistics are unknown or some calculation
+        // in the predicate means we don't know for sure if the row
+        // group can be filtered out or not. To maintain correctness
+        // the row group must be kept and thus `true` is returned.
+        match self.predicate_expr.evaluate(&statistics_batch)? {
+            ColumnarValue::Array(array) => {
+                let predicate_array = array
+                    .as_any()
+                    .downcast_ref::<BooleanArray>()
+                    .ok_or_else(|| {
+                        DataFusionError::Internal(format!(
+                            "Expected pruning predicate evaluation to be 
BooleanArray, \
+                             but was {:?}",
+                            array
+                        ))
+                    })?;
+
+                Ok(predicate_array
+                   .into_iter()
+                   .map(|x| x.unwrap_or(true)) // None -> true per comments 
above
+                   .collect::<Vec<_>>())
+
+            },
+            // result was a column
+            ColumnarValue::Scalar(ScalarValue::Boolean(v)) => {
+                let v = v.unwrap_or(true); // None -> true per comments above

Review Comment:
   This is new code -- to handle boolean exprs (aka `true`)



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -168,38 +170,51 @@ impl PruningPredicate {
     /// simplified version `b`. The predicates are simplified via the
     /// ConstantFolding optimizer pass
     pub fn prune<S: PruningStatistics>(&self, statistics: &S) -> 
Result<Vec<bool>> {
-        // build statistics record batch
-        let predicate_array =
-            build_statistics_record_batch(statistics, &self.required_columns)
-                .and_then(|statistics_batch| {
-                    // execute predicate expression
-                    self.predicate_expr.evaluate(&statistics_batch)
-                })
-                .and_then(|v| match v {
-                    ColumnarValue::Array(array) => Ok(array),
-                    ColumnarValue::Scalar(_) => Err(DataFusionError::Internal(
-                        "predicate expression didn't return an 
array".to_string(),
-                    )),
-                })?;
-
-        let predicate_array = predicate_array
-            .as_any()
-            .downcast_ref::<BooleanArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(format!(
-                    "Expected pruning predicate evaluation to be BooleanArray, 
\
-                     but was {:?}",
-                    predicate_array
-                ))
-            })?;
-
-        // when the result of the predicate expression for a row group is null 
/ undefined,
-        // e.g. due to missing statistics, this row group can't be filtered 
out,
-        // so replace with true
-        Ok(predicate_array
-            .into_iter()
-            .map(|x| x.unwrap_or(true))
-            .collect::<Vec<_>>())
+        // build a RecordBatch that contains the min/max values in the
+        // appropriate statistics columns
+        let statistics_batch =
+            build_statistics_record_batch(statistics, &self.required_columns)?;
+
+        // Evaluate the pruning predicate on that record batch.
+        //
+        // Use true when the result of evaluating a predicate
+        // expression on a row group is null (aka `None`). Null can
+        // arise when the statistics are unknown or some calculation
+        // in the predicate means we don't know for sure if the row
+        // group can be filtered out or not. To maintain correctness
+        // the row group must be kept and thus `true` is returned.
+        match self.predicate_expr.evaluate(&statistics_batch)? {
+            ColumnarValue::Array(array) => {
+                let predicate_array = array
+                    .as_any()
+                    .downcast_ref::<BooleanArray>()
+                    .ok_or_else(|| {
+                        DataFusionError::Internal(format!(
+                            "Expected pruning predicate evaluation to be 
BooleanArray, \
+                             but was {:?}",
+                            array
+                        ))
+                    })?;
+
+                Ok(predicate_array
+                   .into_iter()
+                   .map(|x| x.unwrap_or(true)) // None -> true per comments 
above
+                   .collect::<Vec<_>>())
+
+            },
+            // result was a column
+            ColumnarValue::Scalar(ScalarValue::Boolean(v)) => {
+                let v = v.unwrap_or(true); // None -> true per comments above

Review Comment:
   Without this code and the code to make record batches with 0 columns below, 
some of the new negative pruning tests (the ones with invalid casts) error 
rather than return `true`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3454: Add more cast pruning tests, handle unsupported predicates better

Reply via email to