Re: [PR] Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics [arrow-datafusion]

via GitHub Mon, 13 Nov 2023 06:50:59 -0800


alamb commented on code in PR #8126:
URL: https://github.com/apache/arrow-datafusion/pull/8126#discussion_r1391219185



##########
datafusion/physical-plan/src/filter.rs:
##########
@@ -194,11 +194,13 @@ impl ExecutionPlan for FilterExec {
     fn statistics(&self) -> Result<Statistics> {
         let predicate = self.predicate();
 
+        let input_stats = self.input.statistics()?;
         let schema = self.schema();
         if !check_support(predicate, &schema) {
-            return Ok(Statistics::new_unknown(&schema));
+            // assume worst case, that the filter is highly selective and
+            // returns all the rows from its input
+            return Ok(input_stats.clone().into_inexact());

Review Comment:
   Default selectivities / cost estimates work ok for TPCH queries where the 
data is relatively uniformly distributed. 
   
   However, in general in my experience they tend to cause problems when the 
data is skewed or has correlations between the columns.
   
   Hopefully we'll be able to keep the number of hard coded constants / 
assumptions low in DataFusion (so there are fewer things for the optimizer to 
get wrong :) )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics [arrow-datafusion]

Reply via email to