Re: [PR] Support "A column is known to be entirely NULL" in `PruningPredicate` [arrow-datafusion]

via GitHub Mon, 18 Mar 2024 11:15:42 -0700


appletreeisyellow commented on code in PR #9223:
URL: https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1529036545



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -326,28 +348,46 @@ pub trait PruningStatistics {
 /// LiteralGuarantees are not satisfied
 ///
 /// **Second Pass**: Evaluates the rewritten expression using the
-/// min/max/null_counts values for each column for each container. For any
+/// min/max/null_counts/row_counts values for each column for each container. 
For any
 /// container that this expression evaluates to `false`, it rules out those
 /// containers.
 ///
-/// For example, given the predicate, `x = 5 AND y = 10`, if we know `x` is
-/// between `1 and 100` and we know that `y` is between `4` and `7`, the input
-/// statistics might look like
+///
+/// ### Example 1
+/// Given the predicate, `x = 5 AND y = 10`, if we know that for a given 
container, `x` is
+/// between `1 and 100` and we know that `y` is between `4` and `7`, we know 
nothing about
+/// the null count and row count of `x` and `y`, the input statistics might 
look like:
 ///
 /// Column   | Value
 /// -------- | -----
 /// `x_min`  | `1`
 /// `x_max`  | `100`
+/// `x_null_count` | `null`
+/// `x_row_count`  | `null`
 /// `y_min`  | `4`
 /// `y_max`  | `7`
+/// `y_null_count` | `null`
+/// `y_row_count`  | `null`
 ///
 /// The rewritten predicate would look like
 ///
-/// `x_min <= 5 AND 5 <= x_max AND  y_min <= 10 AND 10 <= y_max`
+/// ```sql
+/// CASE
+///     WHEN x_null_count = x_row_count THEN false
+///     ELSE x_min <= 5 AND 5 <= x_max
+/// END
+/// AND
+/// CASE
+///     WHEN y_null_count = y_row_count THEN false
+///     ELSE y_min <= 10 AND 10 <= y_max
+/// END
+/// ```

Review Comment:
   > I wonder why we need to put x_null_count and x_row_count into rewritten 
predicate if we know nothing about them?
   
   That's my bad that the current doc is not clear. The rewritten predicate 
should happen first for physical plan, then in the pruning stage, we get 
`x_null_count` and `x_row_count` statistics for each container (row group, 
file, etc). So no matter whether we know about x_null_count and x_row_count or 
not, the rewritten predicate happens first.
   
   Updated in 
https://github.com/apache/arrow-datafusion/pull/9223/commits/a8639fb8aceb12333afe99c9fd6275077a017c6e
   
   > Previous rewritten predicate looks more correct/concise to me.
   
   Yes, totally agree. I plan to simplify the case expression and make it easy 
to read in a follow-up PR, tracked in 
https://github.com/apache/arrow-datafusion/issues/9171#issuecomment-1998518531



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support "A column is known to be entirely NULL" in `PruningPredicate` [arrow-datafusion]

Reply via email to