[GitHub] [arrow-datafusion] alamb commented on a change in pull request #965: Move CBOs and Statistics to physical plan

GitBox Sat, 11 Sep 2021 05:12:47 -0700


alamb commented on a change in pull request #965:
URL: https://github.com/apache/arrow-datafusion/pull/965#discussion_r706603438




##########
File path: datafusion/src/physical_plan/explain.rs
##########
@@ -156,6 +158,11 @@ impl ExecutionPlan for ExplainExec {
             }
         }
     }
+
+    fn statistics(&self) -> Statistics {
+        // TODO stats: validate that we don't need to provide statistics on an 
EXPLAIN plan

Review comment:
       not needed

##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -89,6 +89,34 @@ impl Stream for EmptyRecordBatchStream {
 /// Physical planner interface
 pub use self::planner::PhysicalPlanner;
 
+/// Statistics for an physical plan node
+/// Fields are optional and can be inexact because the sources
+/// sometimes provide approximate estimates for performance reasons
+/// and the transformations output are not always predictable.
+#[derive(Debug, Clone, Default, PartialEq)]
+pub struct Statistics {
+    /// The number of table rows
+    pub num_rows: Option<usize>,
+    /// total byte of the table rows
+    pub total_byte_size: Option<usize>,
+    /// Statistics on a column level
+    pub column_statistics: Option<Vec<ColumnStatistics>>,
+    /// Some datasources or transformations might provide inexact estimates

Review comment:
       While writing this, I did wonder (as a follow on PR) we might want to 
make the `is_exact` on a field by field basis (for example, now it is not 
possible for a `Statistics` to have an exact `row_count` but an estimated 
`null_count`)

##########
File path: datafusion/src/physical_plan/empty.rs
##########
@@ -54,6 +54,23 @@ impl EmptyExec {
     pub fn produce_one_row(&self) -> bool {
         self.produce_one_row
     }
+
+    fn data(&self) -> Result<Vec<RecordBatch>> {

Review comment:
       👍 

##########
File path: datafusion/src/physical_plan/cross_join.rs
##########
@@ -207,6 +208,79 @@ impl ExecutionPlan for CrossJoinExec {
             }
         }
     }
+
+    fn statistics(&self) -> Statistics {
+        stats_cartesian_product(
+            self.left.statistics(),
+            self.left.schema().fields().len(),
+            self.right.statistics(),
+            self.right.schema().fields().len(),
+        )
+    }
+}
+
+/// [left/right]_col_count are required in case the column statistics are None
+fn stats_cartesian_product(

Review comment:
       Is the code to calculate cross join cardinality new? If so, perhaps that 
could be pulled into its own PR

##########
File path: datafusion/src/physical_plan/union.rs
##########
@@ -106,16 +109,68 @@ impl ExecutionPlan for UnionExec {
             }
         }
     }
+
+    fn statistics(&self) -> Statistics {
+        self.inputs
+            .iter()
+            .map(|ep| ep.statistics())
+            .reduce(stats_union)

Review comment:
       this is a pretty clever formulation. 👍 

##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -462,6 +462,13 @@ impl ExecutionPlan for HashJoinExec {
     fn metrics(&self) -> Option<MetricsSet> {
         Some(self.metrics.clone_inner())
     }
+
+    fn statistics(&self) -> Statistics {
+        // TODO stats: it is not possible in general to know the output size 
of joins

Review comment:
       I agree -- the other thing that is possible is to estimate the 
selectivity of the predicates using histograms -- there are several academic 
papers on this, though in practice I found such estimates close only for the 
first one  or two joins performed in the plan 

##########
File path: datafusion/src/physical_plan/filter.rs
##########
@@ -144,6 +144,11 @@ impl ExecutionPlan for FilterExec {
     fn metrics(&self) -> Option<MetricsSet> {
         Some(self.metrics.clone_inner())
     }
+
+    /// The output statistics of a filtering operation are unknown

Review comment:
       Eventually it would be cool to estimate selectivity (and hence output 
statistics) based on predicates and histograms but that is a ways off 

##########
File path: datafusion/src/physical_plan/json.rs
##########
@@ -324,6 +326,11 @@ impl ExecutionPlan for NdJsonExec {
             }
         }
     }
+
+    fn statistics(&self) -> Statistics {
+        // TODO stats: handle statistics

Review comment:
       For other reviewers, this is consistent with the current master behavior 
(the Json table providers don't appear to offer statistics either)

##########
File path: datafusion/src/physical_plan/limit.rs
##########
@@ -213,6 +234,30 @@ impl ExecutionPlan for LocalLimitExec {
             }
         }
     }
+
+    fn statistics(&self) -> Statistics {

Review comment:
       this looks like a duplicate of `GlobalLimitExec::statistics` -- could it 
be refactored to avoid the duplication?

##########
File path: datafusion/src/physical_plan/limit.rs
##########
@@ -135,6 +135,27 @@ impl ExecutionPlan for GlobalLimitExec {
             }
         }
     }
+
+    fn statistics(&self) -> Statistics {
+        let input_stats = self.input.statistics();
+        match input_stats {
+            // if the input does not reach the limit globally, return input 
stats
+            Statistics {
+                num_rows: Some(nr), ..
+            } if nr <= self.limit => input_stats,
+            // if the input is greater than the limit, the num_row will be the 
limit
+            // but we won't be able to predict the other statistics
+            Statistics {
+                num_rows: Some(nr), ..
+            } if nr > self.limit => Statistics {
+                num_rows: Some(self.limit),
+                is_exact: input_stats.is_exact,

Review comment:
       It is a shame we have to drop all the information (and can't specify the 
exact number of rows, but estimate the other stats). We can improve this in a 
follow on PR if it turns out to be important

##########
File path: datafusion/tests/provider_filter_pushdown.rs
##########
@@ -98,6 +96,12 @@ impl ExecutionPlan for CustomPlan {
             }
         }
     }
+
+    fn statistics(&self) -> Statistics {
+        // here we could provide more accurate statistics
+        // but we want to test the filter pushdown not the CBOs
+        Statistics::default()

Review comment:
       👍 

##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -89,6 +89,34 @@ impl Stream for EmptyRecordBatchStream {
 /// Physical planner interface
 pub use self::planner::PhysicalPlanner;
 
+/// Statistics for an physical plan node
+/// Fields are optional and can be inexact because the sources
+/// sometimes provide approximate estimates for performance reasons
+/// and the transformations output are not always predictable.
+#[derive(Debug, Clone, Default, PartialEq)]
+pub struct Statistics {
+    /// The number of table rows
+    pub num_rows: Option<usize>,
+    /// total byte of the table rows
+    pub total_byte_size: Option<usize>,
+    /// Statistics on a column level
+    pub column_statistics: Option<Vec<ColumnStatistics>>,
+    /// Some datasources or transformations might provide inexact estimates

Review comment:
       ```suggestion
       /// If true, any field that is `Some(..)` is the actual value in the 
data provided by the operator (it is not
       /// an estimate). Any or all other fields might still be None, in which 
case no information is known.  
       /// if false, any field that is `Some(..)` may contain an inexact 
estimate and may not be the actual value.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #965: Move CBOs and Statistics to physical plan

Reply via email to