alamb commented on code in PR #12471: URL: https://github.com/apache/datafusion/pull/12471#discussion_r1769532838
########## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ########## @@ -741,7 +741,17 @@ impl ExecutionPlan for ParquetExec { } fn statistics(&self) -> Result<Statistics> { - Ok(self.projected_statistics.clone()) + // When filters are pushed down, we have no way of knowing the exact statistics. + // Note that pruning predicate is also a kind of filter pushdown. Review Comment: ```suggestion // Note that pruning predicate is also a kind of filter pushdown. // (bloom filters use `pruning_predicate` too) ``` ########## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ########## @@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec { } fn statistics(&self) -> Result<Statistics> { - Ok(self.projected_statistics.clone()) + // When filters are pushed down, we have no way of knowing the exact statistics. + // Note that pruning predicate is also a kind of filter pushdown. + let stats = if self.pruning_predicate.is_some() + || self.page_pruning_predicate.is_some() Review Comment: Makes sense, maybe we can update the comments to help future readers. I left a suggestion ########## datafusion/core/src/datasource/listing/table.rs: ########## @@ -817,6 +817,20 @@ impl TableProvider for ListingTable { .map(|col| Ok(self.table_schema.field_with_name(&col.0)?.clone())) .collect::<Result<Vec<_>>>()?; + // If the filters can be resolved using only partition cols, there is no need to + // pushdown it to TableScan, otherwise, `unhandled` pruning predicates will be generated + let table_partition_col_names = table_partition_cols Review Comment: I guess I was thinking of "if we removed this code by accident or in some refactoring, would any test change"? If the answer is no, that means we are relying on reviewers to avoid such a regression, which given our available review capacity (never as much as we would like!) is likely not very robust I actually checked by removing this code and running the tests and indeed it does fail as you say ``` failures: ---- sql::path_partition::parquet_statistics stdout ---- thread 'sql::path_partition::parquet_statistics' panicked at datafusion/core/tests/sql/path_partition.rs:492:5: assertion `left == right` failed left: Inexact(1) right: Exact(1) note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace failures: sql::path_partition::parquet_statistics test result: FAILED. 245 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.57s error: test failed, to rerun pass `-p datafusion --test core_integration` ``` I personally think it would be clearer to have a separate test for this pushdown behavior rather than relying on the statistics test, but the code is covered ########## datafusion/core/tests/parquet/file_statistics.rs: ########## @@ -36,8 +37,33 @@ use datafusion_execution::config::SessionConfig; use datafusion_execution::runtime_env::RuntimeEnvBuilder; use datafusion::execution::session_state::SessionStateBuilder; +use datafusion_expr::{BinaryExpr, Expr}; use tempfile::tempdir; +#[tokio::test] +async fn check_stats_precision_with_filter_pushdown() { + let testdata = datafusion::test_util::parquet_test_data(); + let filename = format!("{}/{}", testdata, "alltypes_plain.parquet"); + let table_path = ListingTableUrl::parse(filename).unwrap(); + + let opt = ListingOptions::new(Arc::new(ParquetFormat::default())); + let table = get_listing_table(&table_path, None, &opt).await; + let (_, _, state) = get_cache_runtime_state(); + // Scan without filter, stats are exact + let exec = table.scan(&state, None, &[], None).await.unwrap(); + assert_eq!(exec.statistics().unwrap().num_rows, Precision::Exact(8)); + + // Scan with filter pushdown, stats are inexact Review Comment: 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org