alamb commented on code in PR #9223:
URL: https://github.com/apache/arrow-datafusion/pull/9223#discussion_r1555021735


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -306,17 +315,23 @@ pub trait PruningStatistics {
 /// * `false`: there are no rows that could possibly match the predicate,
 ///            **PRUNES** the container
 ///
-/// For example, given a column `x`, the `x_min` and `x_max` and `x_null_count`
-/// represent the minimum and maximum values, and the null count of column `x`,
-/// provided by the `PruningStatistics`. Here are some examples of the 
rewritten
-/// predicates:
+/// For example, given a column `x`, the `x_min`, `x_max`, `x_null_count`, and
+/// `x_row_count` represent the minimum and maximum values, the null count of
+/// column `x`, and the row count of column `x`, provided by the 
`PruningStatistics`.
+/// `x_null_count` and `x_row_count` are used to handle the case where the 
column `x`
+/// is known to be all `NULL`s. Note this is different from knowing nothing 
about
+/// the column `x`, which confusingly is encoded by returning `NULL` for the 
min/max
+/// values from [`PruningStatistics::max_values`] and 
[`PruningStatistics::min_values`].
+///
+/// Here are some examples of the rewritten predicates:
 ///
 /// Original Predicate | Rewritten Predicate
 /// ------------------ | --------------------
-/// `x = 5` | `x_min <= 5 AND 5 <= x_max`
-/// `x < 5` | `x_max < 5`
-/// `x = 5 AND y = 10` | `x_min <= 5 AND 5 <= x_max AND y_min <= 10 AND 10 <= 
y_max`
-/// `x IS NULL`  | `x_null_count > 0`
+/// `x = 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_min <= 5 
AND 5 <= x_max END`
+/// `x < 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 
END`
+/// `x = 5 AND y = 10` | `CASE WHEN x_null_count = x_row_count THEN false ELSE 
x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN 
false ELSE y_min <= 10 AND 10 <= y_max END`
+/// `x IS NULL`  | `CASE WHEN x_null_count = x_row_count THEN false ELSE 
x_null_count > 0 END`

Review Comment:
   I think you are right @Ted-Jiang. Nicely spotted  -- I double checked and 
indeed all that is done is `null_count > 0`. I'll a PR to fix.
   
   ```
   ❯ explain select duration_nano from traces where duration_nano IS NULL;
   
   |               |     ParquetExec: file_groups={16 groups: [...]}, 
projection=[duration_nano], predicate=duration_nano@1 IS NULL, 
pruning_predicate=duration_nano_null_count@0 > 0, required_guarantees=[] |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to