adriangb commented on code in PR #14297: URL: https://github.com/apache/datafusion/pull/14297#discussion_r1929591473
########## datafusion/physical-optimizer/src/pruning.rs: ########## @@ -819,16 +813,24 @@ impl RequiredColumns { /// statistics column, while keeping track that a reference to the statistics /// column is required /// - /// for example, an expression like `col("foo") > 5`, when called + /// For example, an expression like `col("foo") > 5`, when called /// with Max would result in an expression like `col("foo_max") > - /// 5` with the appropriate entry noted in self.columns + /// 5` with the appropriate entry noted in self.columns. + /// + /// Along with a reference to the min/max/null_count/row_count column + /// an expression for `{stats_col} IS NULL` is returned that should be `OR`ed + /// with any expressions applied to the column itself such that if any value in `{stats_col}` + /// is `null` we return `true` and not `null` because `null` is falsy, but we'd need to scan that + /// container, causing confusion and preventing optimizations because of the inverted logic! + /// + /// Thus for the example above we'd arrive at the final expression `foo_max is null or foo_max > 5`. fn stat_column_expr( &mut self, column: &phys_expr::Column, column_expr: &Arc<dyn PhysicalExpr>, field: &Field, stat_type: StatisticsType, - ) -> Result<Arc<dyn PhysicalExpr>> { + ) -> Result<(Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>)> { Review Comment: The idea here is to force callers to do something with the null check expression, or explcitly discard it. This let me update all call sites with confidence of not having missed one, in addition to nicely abstracting away the logic. ########## datafusion/physical-optimizer/src/pruning.rs: ########## @@ -258,9 +258,7 @@ pub trait PruningStatistics { /// /// * Rows that evaluate to `false` are not returned (“filtered out” or “pruned” or “skipped”). /// -/// * Rows that evaluate to `NULL` are **NOT** returned (also “filtered out”). -/// Note: *this treatment of `NULL` is **DIFFERENT** than how `NULL` is treated -/// in the rewritten predicate described below.* +/// * No rows should evaluate to `null`, even if statistics are missing (the statistics are themselves `null`). Review Comment: I have much more docstrings to update, including examples, but I will hold off on that until I get feedback on the rest of the PR. ########## datafusion/physical-optimizer/src/pruning.rs: ########## @@ -2785,7 +2893,7 @@ mod tests { vec![lit(1), lit(2), lit(3)], false, )); - let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= c1_max@1"; + let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR (c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL OR 3 <= c1_max@1)"; Review Comment: These expressions are long and gross... I wonder if we shouldn't do some combination of formatting the sql and using insta for snapshotting? Or can we serde dump the expressions and pretty format the json? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org