[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #380: Support statistics pruning for formats other than parquet

GitBox Mon, 24 May 2021 10:22:11 -0700


Dandandan commented on a change in pull request #380:
URL: https://github.com/apache/arrow-datafusion/pull/380#discussion_r638136407




##########
File path: datafusion/src/physical_optimizer/pruning.rs
##########
@@ -28,50 +28,75 @@
 //! https://github.com/apache/arrow-datafusion/issues/363 it will
 //! be genericized.
 
-use std::{collections::HashSet, sync::Arc};
+use std::{collections::HashSet, convert::TryInto, sync::Arc};
 
 use arrow::{
-    array::{
-        make_array, new_null_array, ArrayData, ArrayRef, BooleanArray,
-        BooleanBufferBuilder,
-    },
-    buffer::MutableBuffer,
-    datatypes::{DataType, Field, Schema},
+    array::{ArrayRef, BooleanArray},
+    datatypes::{Field, Schema, SchemaRef},
     record_batch::RecordBatch,
 };
 
-use parquet::file::{
-    metadata::RowGroupMetaData, statistics::Statistics as ParquetStatistics,
-};
-
 use crate::{
     error::{DataFusionError, Result},
     execution::context::ExecutionContextState,
     logical_plan::{Expr, Operator},
     optimizer::utils,
     physical_plan::{planner::DefaultPhysicalPlanner, ColumnarValue, 
PhysicalExpr},
+    scalar::ScalarValue,
 };
 
+/// Interface to pass statistics information to [`PruningPredicates`]
+pub trait PruningStatistics {
+    /// return the minimum value for the named column, if known
+    fn min_value(&self, column: &str) -> Option<ScalarValue>;
+
+    /// return the maximum value for the named column, if known
+    fn max_value(&self, column: &str) -> Option<ScalarValue>;
+}
+
+/// Evaluates filter expressions on statistics in order to
+/// prune data containers (e.g. parquet row group)
+///
+/// See [`try_new`] for more information.
 #[derive(Debug, Clone)]
-/// Builder used for generating predicate functions that can be used
-/// to prune data based on statistics (e.g. parquet row group metadata)
-pub struct PruningPredicateBuilder {
-    schema: Schema,
+pub struct PruningPredicate {
+    /// The input schema against which the predicate will be evaluated
+    schema: SchemaRef,
+    /// Actual pruning predicate (rewritten in terms of column min/max 
statistics)
     predicate_expr: Arc<dyn PhysicalExpr>,
+    /// The statistics required to evaluate this predicate:
+    /// * The column name in the input schema
+    /// * Statstics type (e.g. Min or Max)
+    /// * The field the statistics value should be placed in for
+    ///   pruning predicate evaluation
     stat_column_req: Vec<(String, StatisticsType, Field)>,
 }
 
-impl PruningPredicateBuilder {
-    /// Try to create a new instance of [`PruningPredicateBuilder`]
+impl PruningPredicate {
+    /// Try to create a new instance of [`PruningPredicate`]
+    ///
+    /// This will translate the provided `expr` filter expression into
+    /// a *pruning predicate*.
     ///
-    /// This will translate the filter expression into a statistics predicate 
expression
+    /// A pruning predicate is one that has been rewritten in terms of
+    /// the min and max values of column references and that evaluates
+    /// to FALSE if the filter predicate would evaluate FALSE *for
+    /// every row* whose values fell within the min / max ranges (aka
+    /// could be pruned).
     ///
-    /// For example,  `(column / 2) = 4` becomes `(column_min / 2) <= 4 && 4 
<= (column_max / 2))`
-    pub fn try_new(expr: &Expr, schema: Schema) -> Result<Self> {
+    /// The pruning predicate evaluates to TRUE or NULL
+    /// if the filter predicate *might* evaluate to TRUE for at least
+    /// one row whose vaules fell within the min/max ranges (in other

Review comment:
       ```suggestion
       /// one row whose values fell within the min/max ranges (in other
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #380: Support statistics pruning for formats other than parquet

Reply via email to