appletreeisyellow commented on code in PR #9223:
URL: https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1529036545
##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -326,28 +348,46 @@ pub trait PruningStatistics {
/// LiteralGuarantees are not satisfied
///
/// **Second Pass**: Evaluates the rewritten expression using the
-/// min/max/null_counts values for each column for each container. For any
+/// min/max/null_counts/row_counts values for each column for each container.
For any
/// container that this expression evaluates to `false`, it rules out those
/// containers.
///
-/// For example, given the predicate, `x = 5 AND y = 10`, if we know `x` is
-/// between `1 and 100` and we know that `y` is between `4` and `7`, the input
-/// statistics might look like
+///
+/// ### Example 1
+/// Given the predicate, `x = 5 AND y = 10`, if we know that for a given
container, `x` is
+/// between `1 and 100` and we know that `y` is between `4` and `7`, we know
nothing about
+/// the null count and row count of `x` and `y`, the input statistics might
look like:
///
/// Column | Value
/// -------- | -----
/// `x_min` | `1`
/// `x_max` | `100`
+/// `x_null_count` | `null`
+/// `x_row_count` | `null`
/// `y_min` | `4`
/// `y_max` | `7`
+/// `y_null_count` | `null`
+/// `y_row_count` | `null`
///
/// The rewritten predicate would look like
///
-/// `x_min <= 5 AND 5 <= x_max AND y_min <= 10 AND 10 <= y_max`
+/// ```sql
+/// CASE
+/// WHEN x_null_count = x_row_count THEN false
+/// ELSE x_min <= 5 AND 5 <= x_max
+/// END
+/// AND
+/// CASE
+/// WHEN y_null_count = y_row_count THEN false
+/// ELSE y_min <= 10 AND 10 <= y_max
+/// END
+/// ```
Review Comment:
> I wonder why we need to put x_null_count and x_row_count into rewritten
predicate if we know nothing about them?
That's my bad that the current doc is not clear. The rewritten predicate
should happen first for physical plan, then in the pruning stage, we get
`x_null_count` and `x_row_count` statistics for each container (row group,
file, etc). So no matter whether we know about x_null_count and x_row_count or
not, the rewritten predicate happens first.
Updated in
https://github.com/apache/arrow-datafusion/pull/9223/commits/a8639fb8aceb12333afe99c9fd6275077a017c6e
> Previous rewritten predicate looks more correct/concise to me.
Yes, totally agree. I plan to simplify the case expression and make it easy
to read in a follow-up PR, tracked in
https://github.com/apache/arrow-datafusion/issues/9171#issuecomment-1998518531
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]