alamb commented on code in PR #10051:
URL:
https://github.com/apache/arrow-datafusion/pull/10051#discussion_r1562854564
##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -335,7 +335,7 @@ pub trait PruningStatistics {
/// `x < 5` | `CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5
END`
/// `x = 5 AND y = 10` | `CASE WHEN x_null_count = x_row_count THEN false ELSE
x_min <= 5 AND 5 <= x_max END AND CASE WHEN y_null_count = y_row_count THEN
false ELSE y_min <= 10 AND 10 <= y_max END`
/// `x IS NULL` | `x_null_count > 0`
-/// `x IS NOT NULL` | `x_null_count = 0`
+/// `x IS NOT NULL` | `!(x_null_count = row_count)`
Review Comment:
What do you think about using `!=` instead? it would make the expression
simpler and likely be (very slightly) faster?
```suggestion
/// `x IS NOT NULL` | `x_null_count != row_count`
```
##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) ->
Vec<RecordBatch> {
}
Scenario::WithNullValues => {
Review Comment:
Could you please also add the "some null/some not null" null values case to
this scenario as well?
```
make_int_batches_with_null(5, 1, 6),
```
##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -632,22 +633,60 @@ fn make_names_batch(name: &str, service_name_values:
Vec<&str>) -> RecordBatch {
RecordBatch::try_new(schema, vec![Arc::new(name),
Arc::new(service_name)]).unwrap()
}
-/// Return record batch with i8, i16, i32, and i64 sequences with all Null
values
-fn make_all_null_values() -> RecordBatch {
+/// Return record batch with i8, i16, i32, and i64 sequences with Null values
Review Comment:
I think it would help to explain the shape of this data a bit more to help
readers. Something like the following (though note it would change if you
change the argments as I suggest below)
```suggestion
/// Return record batch with i8, i16, i32, and i64 sequences with Null
values.
///
/// The first `null_values` values are null and the following
/// `no_null_values_end - no_null_values_start` values are non null.
```
##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -1241,9 +1241,11 @@ fn build_single_column_expr(
/// if the column may contain null, and false if definitely does not
/// contain null.
/// If set `with_not` to true: which means is not null
+/// because datafusion use false flag of expr result to prune unit (row group,
page ..)
/// Given an expression reference to `expr`, if `expr` is a column expression,
/// returns a pruning expression in terms of IsNotNull that will evaluate to
true
-/// if the column not contain any null, and false if definitely contain null.
+/// if the column may contain any non-null values, and false if definitely
does not contain
+/// non-null values null as all null values.
Review Comment:
I always find this logic very confusing. Here is a proposed wording change
that maybe makes it clearer
If you think this is reasonable, perhaps you can make the same changes to
the start of the comment
```suggestion
/// If `with_not` is true, build a pruning expression for `col IS NOT NULL`:
`col_count != col_null_count`
/// The pruning expression evaluates to true ONLY if the column definitely
CONTAINS
/// at least one NULL value. In this case we can know that `IS NOT NULL`
can not be true and
/// thus can prune the row group / value
```
##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
}
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where i8 <= 6",
+ Some(0),
+ // expect prune two pages which 10 rows
Review Comment:
I went through the tests carefully and made notes for myself of which row
groups were being pruned. I figured I would suggest adding them as comments as
well
```suggestion
// expect prune pages with all null or pages that have only values
greater than 6
// (row_group1, page2), (row_group4, page2)
```
##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -632,22 +633,60 @@ fn make_names_batch(name: &str, service_name_values:
Vec<&str>) -> RecordBatch {
RecordBatch::try_new(schema, vec![Arc::new(name),
Arc::new(service_name)]).unwrap()
}
-/// Return record batch with i8, i16, i32, and i64 sequences with all Null
values
-fn make_all_null_values() -> RecordBatch {
+/// Return record batch with i8, i16, i32, and i64 sequences with Null values
+/// here 5 rows in page when using Unit::Page
+fn make_int_batches_with_null(
+ null_values: usize,
+ no_null_values_start: usize,
+ no_null_values_end: usize,
Review Comment:
I think you might be able to simplify this method by sending in 2
parameters. Right now it looks like it interleaves nulls arbitrarily but really
the nulls are always at the start and non nulls are at the end
```rust
num_nulls: usize,
non_nulls: usize
```
##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) ->
Vec<RecordBatch> {
}
Scenario::WithNullValues => {
vec![
- make_all_null_values(),
+ make_int_batches_with_null(5, 0, 0),
make_int_batches(1, 6),
- make_all_null_values(),
+ make_int_batches_with_null(5, 0, 0),
+ ]
+ }
+ Scenario::WithNullValuesPageLevel => {
+ vec![
+ make_int_batches_with_null(5, 1, 6),
Review Comment:
I think this scenario should have at least one batch of all nulls (e.g.
`make_int_batches_with_nulls(5, 0, 0)`)
##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
}
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where i8 <= 6",
+ Some(0),
+ // expect prune two pages which 10 rows
+ Some(10),
+ 22,
+ )
+ .await;
+
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where \"i16\" is not null",
+ Some(0),
+ // expect prune two pages which 10 rows
+ Some(10),
+ 29,
+ )
+ .await;
+
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where \"i32\" is null",
+ Some(0),
+ // expect prune 5 pages which 25 rows
Review Comment:
```suggestion
// expect prune (row_group1, page1), (row_group2, page1+2),
(row_group3, page1), (row_group3, page1) = 25 rows
```
##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
}
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+ test_prune(
Review Comment:
It took me a while to convince myself that this was actually setting up the
scenario as described. I eventually found it here:
https://github.com/alamb/arrow-datafusion/blob/dee926519030301f052dc2c3196e4fbef0da4c47/datafusion/core/tests/parquet/mod.rs#L916-L920
I wonder if it would be possible to add some better comments to `test_prune`
mentioning that the created parquet files have 5 rows per page
##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
}
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where i8 <= 6",
+ Some(0),
+ // expect prune two pages which 10 rows
+ Some(10),
+ 22,
+ )
+ .await;
+
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where \"i16\" is not null",
+ Some(0),
+ // expect prune two pages which 10 rows
+ Some(10),
+ 29,
+ )
+ .await;
+
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where \"i32\" is null",
+ Some(0),
+ // expect prune 5 pages which 25 rows
+ Some(25),
+ 11,
+ )
+ .await;
+
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where \"i64\" > 6",
+ Some(0),
+ // 6 pages will be pruned which 30 rows
Review Comment:
```suggestion
// expect to prune pages where i is all null, or where always <= 5
// (row_group1, page1+2), (row_group2, page1), (row_group3, page1)
(row_group4, page1+2) = 30 rows
//
```
##########
datafusion/core/tests/parquet/page_pruning.rs:
##########
@@ -871,6 +871,55 @@ async fn without_pushdown_filter() {
assert!(bytes_scanned_with_filter > bytes_scanned_without_filter);
}
+#[tokio::test]
+// Data layout like this:
+// row_group1: page1(1~5), page2(All Null)
+// row_group2: page1(1~5), page2(6~10)
+// row_group3: page1(1~5), page2(6~9 + Null)
+// row_group4: page1(1~5), page2(All Null)
+// total 40 rows
+async fn test_pages_with_null_values() {
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where i8 <= 6",
+ Some(0),
+ // expect prune two pages which 10 rows
+ Some(10),
+ 22,
+ )
+ .await;
+
+ test_prune(
+ Scenario::WithNullValuesPageLevel,
+ "SELECT * FROM t where \"i16\" is not null",
+ Some(0),
+ // expect prune two pages which 10 rows
Review Comment:
```suggestion
// expect prune (row_group1, page2) and (row_group4, page2) = 10 rows
```
##########
datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs:
##########
@@ -548,7 +551,20 @@ impl<'a> PruningStatistics for PagesPruningStatistics<'a> {
}
fn row_counts(&self, _column: &datafusion_common::Column) ->
Option<ArrayRef> {
- None
+ // see
https://github.com/apache/arrow-rs/blob/91f0b1771308609ca27db0fb1d2d49571b3980d8/parquet/src/file/metadata.rs#L979-L982
+ let mut first_row_index = self
+ .col_offset_indexes
Review Comment:
I think you can do this same calculation without allocating intermediate vec
s-- something like this:
```rust
let row_count_per_page = self
.col_offset_indexes
.windows(2)
.map(|location| Some(location[1].first_row_index -
location[0].first_row_index));
Some(Arc::new(Int64Array::from_iter(row_count_per_page)))
```
🤔 the name `col_offset_indexes` is somewhat confusing to me as they are
`PageLocation`s --
```
col_offset_indexes: &'a Vec<PageLocation>,
```
maybe we could rename that field to `page_locations` 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]