alamb commented on code in PR #12471:
URL: https://github.com/apache/datafusion/pull/12471#discussion_r1761635636


##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec {
     }
 
     fn statistics(&self) -> Result<Statistics> {
-        Ok(self.projected_statistics.clone())
+        // When filters are pushed down, we have no way of knowing the exact 
statistics.
+        // Note that pruning predicate is also a kind of filter pushdown.
+        let stats = if self.pruning_predicate.is_some()
+            || self.page_pruning_predicate.is_some()

Review Comment:
   I think bloom filters should also belong in this list 🤔 
   
   



##########
datafusion/physical-plan/src/execution_plan.rs:
##########
@@ -379,6 +379,9 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync {
     /// Returns statistics for this `ExecutionPlan` node. If statistics are not
     /// available, should return [`Statistics::new_unknown`] (the default), not
     /// an error.
+    ///
+    /// For TableScan executors, which supports filter pushdown, special 
attention

Review Comment:
   💯 



##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -817,6 +817,20 @@ impl TableProvider for ListingTable {
             .map(|col| Ok(self.table_schema.field_with_name(&col.0)?.clone()))
             .collect::<Result<Vec<_>>>()?;
 
+        // If the filters can be resolved using only partition cols, there is 
no need to
+        // pushdown it to TableScan, otherwise, `unhandled` pruning predicates 
will be generated
+        let table_partition_col_names = table_partition_cols

Review Comment:
   This makes sense -- but I don't see a test for it. I also feel like this 
logic might be duplicated elsewhere(maybe the "can push filters code")? 
   
   What would you think about updating the fields on `ListingTableProvider` 
somehow to reflect this knowledge a bit more centrally?



##########
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##########
@@ -739,7 +739,17 @@ impl ExecutionPlan for ParquetExec {
     }
 
     fn statistics(&self) -> Result<Statistics> {
-        Ok(self.projected_statistics.clone())
+        // When filters are pushed down, we have no way of knowing the exact 
statistics.
+        // Note that pruning predicate is also a kind of filter pushdown.
+        let stats = if self.pruning_predicate.is_some()
+            || self.page_pruning_predicate.is_some()
+            || (self.predicate.is_some() && self.pushdown_filters())
+        {
+            Statistics::new_unknown(&self.schema())

Review Comment:
   Rather than discarding all statistics, what do you think about marking any 
precisions as `Precision::Inexact`?
   
   I am a bit concerned that we would simply discard the statistics entirely in 
the (very common) case of filter pushdown 🤔 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to