alamb commented on a change in pull request #965:
URL: https://github.com/apache/arrow-datafusion/pull/965#discussion_r706599558
##########
File path: ballista/rust/core/src/execution_plans/distributed_query.rs
##########
@@ -203,6 +203,12 @@ impl ExecutionPlan for DistributedQueryExec {
}
}
}
+
+ fn statistics(&self) -> Statistics {
+ // We cannot infer the statistics until the logical plan
Review comment:
this comment perhaps seems out of date
##########
File path: datafusion/src/datasource/mod.rs
##########
@@ -231,7 +232,10 @@ pub fn get_statistics_with_limit(
break;
}
}
- all_files.truncate(num_files);
+ if num_files < all_files.len() {
Review comment:
👍
##########
File path: datafusion/src/execution/context.rs
##########
@@ -710,17 +710,17 @@ impl Default for ExecutionConfig {
optimizers: vec![
Arc::new(ConstantFolding::new()),
Arc::new(EliminateLimit::new()),
- Arc::new(AggregateStatistics::new()),
Arc::new(ProjectionPushDown::new()),
Arc::new(FilterPushDown::new()),
Arc::new(SimplifyExpressions::new()),
- Arc::new(HashBuildProbeOrder::new()),
Arc::new(LimitPushDown::new()),
],
physical_optimizers: vec![
Arc::new(CoalesceBatches::new()),
Arc::new(Repartition::new()),
Arc::new(AddCoalescePartitionsExec::new()),
+ Arc::new(AggregateStatistics::new()),
Review comment:
I think we probably want to run the `AggregateStatistics` passes first
(prior to running the other optimizers); Even though it probably doesn't matter
at this time, if any of those optimizers were to use the statistics now or in
the future it might lead to tricky bugs.
##########
File path: datafusion/src/physical_optimizer/hash_build_probe_order.rs
##########
@@ -0,0 +1,291 @@
+// Licensed to the Apache Software Foundation (ASF) under one
Review comment:
this file also has substantial changes from
eef5e2d10e60652aa498925fae3f8c688f5b99fe
`src/optimizer/hash_build_probe_order.rs `
##########
File path: datafusion/src/execution/context.rs
##########
@@ -710,17 +710,17 @@ impl Default for ExecutionConfig {
optimizers: vec![
Arc::new(ConstantFolding::new()),
Arc::new(EliminateLimit::new()),
- Arc::new(AggregateStatistics::new()),
Arc::new(ProjectionPushDown::new()),
Arc::new(FilterPushDown::new()),
Arc::new(SimplifyExpressions::new()),
- Arc::new(HashBuildProbeOrder::new()),
Arc::new(LimitPushDown::new()),
],
physical_optimizers: vec![
Arc::new(CoalesceBatches::new()),
Arc::new(Repartition::new()),
Arc::new(AddCoalescePartitionsExec::new()),
+ Arc::new(AggregateStatistics::new()),
+ Arc::new(HashBuildProbeOrder::new()),
Review comment:
likewise it seems like HashBuildProbeOrder might want to go early in the
optimizer list -- the repartition / coalesce stuff is mostly mechanical where
as the join order changes might change where those mechanical changes go
##########
File path: datafusion/src/physical_plan/common.rs
##########
@@ -169,6 +169,48 @@ pub(crate) fn spawn_execution(
})
}
+/// Computes the statistics for on in-memory RecordBatch
Review comment:
```suggestion
/// Computes the statistics for an in-memory RecordBatch
```
##########
File path: ballista/rust/core/src/execution_plans/unresolved_shuffle.rs
##########
@@ -117,4 +119,12 @@ impl ExecutionPlan for UnresolvedShuffleExec {
}
}
}
+
+ fn statistics(&self) -> Statistics {
+ // We could try to fetch the statistics here from the shuffle writer,
Review comment:
I think this comment is now out of date (as in this PR is actually doing
what is suggested in the PR)
##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -89,6 +89,34 @@ impl Stream for EmptyRecordBatchStream {
/// Physical planner interface
pub use self::planner::PhysicalPlanner;
+/// Statistics for an physical plan node
Review comment:
At some point it probably makes sense to put these into their own module
(aka `pub mod stats` or something). Doesn't have to be for this PR
##########
File path: datafusion/src/datasource/parquet.rs
##########
@@ -439,10 +426,6 @@ mod tests {
.fold(0, |acc, _| async move { acc + 1i32 })
.await;
- // test metadata
- assert_eq!(table.statistics().num_rows, Some(8));
Review comment:
These tests seem not to have been ported over the
physical_plan/parquet.rs (or I may have missed them). It would be good to keep
them
##########
File path: datafusion/src/datasource/datasource.rs
##########
@@ -104,15 +80,6 @@ pub trait TableProvider: Sync + Send {
limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>>;
- /// Returns the table Statistics
- /// Statistics should be optional because not all data sources can provide
statistics.
- fn statistics(&self) -> Statistics;
-
- /// Returns whether statistics provided are exact values or estimates
- fn has_exact_statistics(&self) -> bool {
Review comment:
this is also technically an API change (that `has_exact_statistics` is
now encoded in the Statistics itself) -- I think it is a good change.
##########
File path: datafusion/src/physical_plan/analyze.rs
##########
@@ -206,4 +208,9 @@ impl ExecutionPlan for AnalyzeExec {
}
}
}
+
+ fn statistics(&self) -> Statistics {
+ // TODO stats: validate that we don't need to provide statistics on an
EXPLAIN plan
Review comment:
I do not think stats on an explain plan add much/any value
##########
File path: datafusion/src/physical_optimizer/aggregate_statistics.rs
##########
@@ -0,0 +1,384 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Utilizing exact statistics from sources to avoid scanning data
+use std::sync::Arc;
+
+use arrow::datatypes::Schema;
Review comment:
This file seems to have substantial changes from
eef5e2d10e60652aa498925fae3f8c688f5b99fe src/optimizer/aggregate_statistics.rs
-- it is not just a rename. I am mostly noting this for other reviewers and my
own notes to review it more carefully
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]