Re: [PR] Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune [arrow-datafusion]

via GitHub Fri, 12 Apr 2024 10:36:12 -0700


alamb commented on code in PR #10051:
URL: 
https://github.com/apache/arrow-datafusion/pull/10051#discussion_r1562854564



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -335,7 +335,7 @@ pub trait PruningStatistics {
 /// `x < 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 
END`
 /// `x = 5 AND y = 10` | `CASE WHEN x_null_count = x_row_count THEN false ELSE 
x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN 
false ELSE y_min <= 10 AND 10 <= y_max END`
 /// `x IS NULL`  | `x_null_count > 0`
-/// `x IS NOT NULL`  | `x_null_count = 0`
+/// `x IS NOT NULL`  | `!(x_null_count = row_count)`

Review Comment:
   What do you think about using `!=` instead? it would make the expression 
simpler and likely be (very slightly) faster?
   
   ```suggestion
   /// `x IS NOT NULL`  | `x_null_count != row_count`
   ```



##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) -> 
Vec<RecordBatch> {
         }
         Scenario::WithNullValues => {

Review Comment:
   Could you please also add the "some null/some not null" null values case to 
this scenario as well? 
   ```
                   make_int_batches_with_null(5, 1, 6),
   ```



##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -632,22 +633,60 @@ fn make_names_batch(name: &str, service_name_values: 
Vec<&str>) -> RecordBatch {
     RecordBatch::try_new(schema, vec![Arc::new(name), 
Arc::new(service_name)]).unwrap()
 }
 
-/// Return record batch with i8, i16, i32, and i64 sequences with all Null 
values
-fn make_all_null_values() -> RecordBatch {
+/// Return record batch with i8, i16, i32, and i64 sequences with Null values

Review Comment:
   I think it would help to explain the shape of this data a bit more to help 
readers. Something like the following (though note it would change if you 
change the argments as I suggest below)
   
   ```suggestion
   /// Return record batch with i8, i16, i32, and i64 sequences with Null 
values.
   ///
   /// The first `null_values` values are null and the following 
   /// `no_null_values_end - no_null_values_start` values are non null.
   ```



##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -1241,9 +1241,11 @@ fn build_single_column_expr(
 /// if the column may contain null, and false if definitely does not
 /// contain null.
 /// If set `with_not` to true: which means is not null
+/// because datafusion use false flag of expr result to prune unit (row group, 
page ..)
 /// Given an expression reference to `expr`, if `expr` is a column expression,
 /// returns a pruning expression in terms of IsNotNull that will evaluate to 
true
-/// if the column not contain any null, and false if definitely contain null.
+/// if the column may contain any non-null values, and false if definitely 
does not contain
+/// non-null values null as all null values.

Review Comment:
   I always find this logic very confusing. Here is a proposed wording change 
that maybe makes it clearer
   
   If you think this is reasonable, perhaps you can make the same changes to 
the start of the comment
   
   ```suggestion
   /// If `with_not` is true, build a pruning expression for `col IS NOT NULL`: 
`col_count != col_null_count`
   /// The pruning expression evaluates to true ONLY if the column definitely 
CONTAINS
   /// at least one NULL value.  In this case we can know that `IS NOT NULL` 
can not be true and
   /// thus can prune the row group / value
   ```



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
     assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
 }
 
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where i8 <= 6",
+        Some(0),
+        // expect prune two pages which 10 rows

Review Comment:
   I went through the tests carefully and made notes for myself of which row 
groups were being pruned. I figured I would suggest adding them as comments as 
well
   
   ```suggestion
           // expect prune pages with all null or pages that have only values 
greater than 6
           // (row_group1, page2), (row_group4, page2)
   ```



##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -632,22 +633,60 @@ fn make_names_batch(name: &str, service_name_values: 
Vec<&str>) -> RecordBatch {
     RecordBatch::try_new(schema, vec![Arc::new(name), 
Arc::new(service_name)]).unwrap()
 }
 
-/// Return record batch with i8, i16, i32, and i64 sequences with all Null 
values
-fn make_all_null_values() -> RecordBatch {
+/// Return record batch with i8, i16, i32, and i64 sequences with Null values
+/// here 5 rows in page when using Unit::Page
+fn make_int_batches_with_null(
+    null_values: usize,
+    no_null_values_start: usize,
+    no_null_values_end: usize,

Review Comment:
   I think you might be able to simplify this method by sending in 2 
parameters. Right now it looks like it interleaves nulls arbitrarily but really 
the nulls are always at the start and non nulls are at the end
   
   ```rust
     num_nulls: usize,
     non_nulls: usize
   ```



##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) -> 
Vec<RecordBatch> {
         }
         Scenario::WithNullValues => {
             vec![
-                make_all_null_values(),
+                make_int_batches_with_null(5, 0, 0),
                 make_int_batches(1, 6),
-                make_all_null_values(),
+                make_int_batches_with_null(5, 0, 0),
+            ]
+        }
+        Scenario::WithNullValuesPageLevel => {
+            vec![
+                make_int_batches_with_null(5, 1, 6),

Review Comment:
   I think this scenario should have at least one batch of all nulls (e.g. 
`make_int_batches_with_nulls(5, 0, 0)`)



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
     assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
 }
 
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where i8 <= 6",
+        Some(0),
+        // expect prune two pages which 10 rows
+        Some(10),
+        22,
+    )
+    .await;
+
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where \"i16\" is not null",
+        Some(0),
+        // expect prune two pages which 10 rows
+        Some(10),
+        29,
+    )
+    .await;
+
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where \"i32\" is null",
+        Some(0),
+        // expect prune 5 pages which 25 rows

Review Comment:
   ```suggestion
           // expect prune (row_group1, page1), (row_group2, page1+2), 
(row_group3, page1), (row_group3, page1) =  25 rows
   ```



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
     assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
 }
 
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+    test_prune(

Review Comment:
   It took me a while to convince myself that this was actually setting up the 
scenario as described. I eventually found it here:
   
   
   
https://github.com/alamb/arrow-datafusion/blob/dee926519030301f052dc2c3196e4fbef0da4c47/datafusion/core/tests/parquet/mod.rs#L916-L920
   
   I wonder if it would be possible to add some better comments to `test_prune` 
mentioning that the created parquet files have 5 rows per page



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
     assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
 }
 
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where i8 <= 6",
+        Some(0),
+        // expect prune two pages which 10 rows
+        Some(10),
+        22,
+    )
+    .await;
+
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where \"i16\" is not null",
+        Some(0),
+        // expect prune two pages which 10 rows
+        Some(10),
+        29,
+    )
+    .await;
+
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where \"i32\" is null",
+        Some(0),
+        // expect prune 5 pages which 25 rows
+        Some(25),
+        11,
+    )
+    .await;
+
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where \"i64\" > 6",
+        Some(0),
+        // 6 pages will be pruned which 30 rows

Review Comment:
   ```suggestion
           // expect to prune pages where i is all null, or where always <= 5
           // (row_group1, page1+2), (row_group2, page1), (row_group3, page1) 
(row_group4, page1+2) = 30 rows
           // 
   ```



##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
     assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
 }
 
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where i8 <= 6",
+        Some(0),
+        // expect prune two pages which 10 rows
+        Some(10),
+        22,
+    )
+    .await;
+
+    test_prune(
+        Scenario::WithNullValuesPageLevel,
+        "SELECT * FROM t where \"i16\" is not null",
+        Some(0),
+        // expect prune two pages which 10 rows

Review Comment:
   ```suggestion
           // expect prune (row_group1, page2) and (row_group4, page2) = 10 rows
   ```



##########
datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs:
##########
@@ -548,7 +551,20 @@ impl<'a> PruningStatistics for PagesPruningStatistics<'a> {
     }
 
     fn row_counts(&self, _column: &datafusion_common::Column) -> 
Option<ArrayRef> {
-        None
+        // see 
https://github.com/apache/arrow-rs/blob/91f0b1771308609ca27db0fb1d2d49571b3980d8/parquet/src/file/metadata.rs#L979-L982
+        let mut first_row_index = self
+            .col_offset_indexes

Review Comment:
   I think you can do this same calculation without allocating intermediate vec 
s-- something like this:
   
   ```rust
           let row_count_per_page = self
               .col_offset_indexes
               .windows(2)
               .map(|location| Some(location[1].first_row_index - 
location[0].first_row_index));
   
           Some(Arc::new(Int64Array::from_iter(row_count_per_page)))
   ```
   
   
   
   🤔  the name `col_offset_indexes` is somewhat confusing to me as they are 
`PageLocation`s -- 
   
   ```
       col_offset_indexes: &'a Vec<PageLocation>,
   ```
   
   maybe we could rename that field to `page_locations` 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune [arrow-datafusion]

Reply via email to