[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #965: Move CBOs and Statistics to physical plan

GitBox Sun, 12 Sep 2021 00:01:08 -0700


yjshen commented on a change in pull request #965:
URL: https://github.com/apache/arrow-datafusion/pull/965#discussion_r706770798




##########
File path: datafusion/src/physical_optimizer/utils.rs
##########
@@ -0,0 +1,47 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Collection of utility functions that are leveraged by the query optimizer 
rules
+
+use super::optimizer::PhysicalOptimizerRule;
+use crate::execution::context::ExecutionConfig;
+
+use crate::error::Result;
+use crate::physical_plan::ExecutionPlan;
+use std::sync::Arc;
+
+/// Convenience rule for writing optimizers: recursively invoke
+/// optimize on plan's children and then return a node of the same
+/// type. Useful for optimizer rules which want to leave the type
+/// of plan unchanged but still apply to the children.
+pub fn optimize_children(

Review comment:
       👍

##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -89,6 +89,34 @@ impl Stream for EmptyRecordBatchStream {
 /// Physical planner interface
 pub use self::planner::PhysicalPlanner;
 
+/// Statistics for an physical plan node
+/// Fields are optional and can be inexact because the sources
+/// sometimes provide approximate estimates for performance reasons
+/// and the transformations output are not always predictable.
+#[derive(Debug, Clone, Default, PartialEq)]
+pub struct Statistics {
+    /// The number of table rows
+    pub num_rows: Option<usize>,
+    /// total byte of the table rows
+    pub total_byte_size: Option<usize>,
+    /// Statistics on a column level
+    pub column_statistics: Option<Vec<ColumnStatistics>>,
+    /// Some datasources or transformations might provide inexact estimates
+    pub is_exact: bool,
+}
+/// This table statistics are estimates about column
+#[derive(Clone, Debug, Default, PartialEq)]
+pub struct ColumnStatistics {

Review comment:
       It might be beneficial to **compute** column statistics through a 
separate command, **store** it in a catalog or meta store, and **load** column 
stats from the storage lazily when we want a more accurate CBO result.  Such as 
histogram @alamb have mentioned earlier. 
   
   Otherwise, I think the current approach in this PR that only considers 
`Statistics` is enough to get low-hanging fruits for CBO.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #965: Move CBOs and Statistics to physical plan

Reply via email to