alamb commented on code in PR #12471: URL: https://github.com/apache/datafusion/pull/12471#discussion_r1761635636
########## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ########## @@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec { } fn statistics(&self) -> Result<Statistics> { - Ok(self.projected_statistics.clone()) + // When filters are pushed down, we have no way of knowing the exact statistics. + // Note that pruning predicate is also a kind of filter pushdown. + let stats = if self.pruning_predicate.is_some() + || self.page_pruning_predicate.is_some() Review Comment: I think bloom filters should also belong in this list 🤔 ########## datafusion/physical-plan/src/execution_plan.rs: ########## @@ -379,6 +379,9 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync { /// Returns statistics for this `ExecutionPlan` node. If statistics are not /// available, should return [`Statistics::new_unknown`] (the default), not /// an error. + /// + /// For TableScan executors, which supports filter pushdown, special attention Review Comment: 💯 ########## datafusion/core/src/datasource/listing/table.rs: ########## @@ -817,6 +817,20 @@ impl TableProvider for ListingTable { .map(|col| Ok(self.table_schema.field_with_name(&col.0)?.clone())) .collect::<Result<Vec<_>>>()?; + // If the filters can be resolved using only partition cols, there is no need to + // pushdown it to TableScan, otherwise, `unhandled` pruning predicates will be generated + let table_partition_col_names = table_partition_cols Review Comment: This makes sense -- but I don't see a test for it. I also feel like this logic might be duplicated elsewhere(maybe the "can push filters code")? What would you think about updating the fields on `ListingTableProvider` somehow to reflect this knowledge a bit more centrally? ########## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ########## @@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec { } fn statistics(&self) -> Result<Statistics> { - Ok(self.projected_statistics.clone()) + // When filters are pushed down, we have no way of knowing the exact statistics. + // Note that pruning predicate is also a kind of filter pushdown. + let stats = if self.pruning_predicate.is_some() + || self.page_pruning_predicate.is_some() + || (self.predicate.is_some() && self.pushdown_filters()) + { + Statistics::new_unknown(&self.schema()) Review Comment: Rather than discarding all statistics, what do you think about marking any precisions as `Precision::Inexact`? I am a bit concerned that we would simply discard the statistics entirely in the (very common) case of filter pushdown 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org