Re: [PR] parquet: Make page_index/pushdown metrics consistent with row_group metrics [datafusion]

via GitHub Fri, 20 Sep 2024 07:42:45 -0700


alamb commented on code in PR #12545:
URL: https://github.com/apache/datafusion/pull/12545#discussion_r1768748882



##########
datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs:
##########
@@ -276,6 +281,14 @@ fn rows_skipped(selection: &RowSelection) -> usize {
         .fold(0, |acc, x| if x.skip { acc + x.row_count } else { acc })
 }
 
+/// returns the number of rows not skipped in the selection
+/// TODO should this be upstreamed to RowSelection?

Review Comment:
   This looks the same as 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html#method.row_count
   
   It would be great to upstream this and rows_skipped to `parquet` -- any 
chance you are willing to file a ticket to do so?



##########
docs/source/user-guide/explain-usage.md:
##########
@@ -223,6 +223,21 @@ Again, reading from bottom up:
 - `SortPreservingMergeExec`
   - `output_rows=5`, `elapsed_compute=2.375µs`: Produced the final 5 rows in 
2.375µs (microseconds)
 
+When predicate pushdown is enabled, `ParquetExec` gains the following metrics:
+
+- `page_index_rows_matched`: number of rows in pages that were tested by a 
page index filter, and passed
+- `page_index_rows_pruned`: number of rows in pages that were tested by a page 
index filter, and did not pass
+- `row_groups_matched_bloom_filter`: number of rows in row groups that were 
tested by a Bloom Filter, and passed
+- `row_groups_pruned_bloom_filter`: number of rows in row groups that were 
tested by a Bloom Filter, and did not pass
+- `row_groups_matched_statistics`: number of rows in row groups that were 
tested by row group statistics (min and max value), and passed
+- `row_groups_pruned_statistics`: number of rows in row groups that were 
tested by row group statistics (min and max value), and did not pass
+- `pushdown_rows_matched`: rows that were tested by any of the above filtered, 
and passed all of them (this should be minimum of `page_index_rows_matched`, 
`row_groups_pruned_bloom_filter`, and `row_groups_pruned_statistics`)
+- `pushdown_rows_pruned`: rows that were tested by any of the above filtered, 
and did not pass one of them (this should be sum of `page_index_rows_matched`, 
`row_groups_pruned_bloom_filter`, and `row_groups_pruned_statistics`)
+- `predicate_evaluation_errors`
+- `num_predicate_creation_errors`
+- `pushdown_eval_time`: time spent evaluating these filters
+- `page_index_eval_time`

Review Comment:
   ```suggestion
   - `predicate_evaluation_errors`: number of times evaluating the filter 
expression failed (expected to be zero in normal operation)
   - `num_predicate_creation_errors`: number of errors creating predicates 
(expected to be zero in normal operation)
   - `pushdown_eval_time`: time spent evaluating these filters
   - `page_index_eval_time`: time required to evaluate the page index filters
   ```



##########
datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs:
##########
@@ -178,6 +178,8 @@ impl PagePruningAccessPlanFilter {
 
         // track the total number of rows that should be skipped
         let mut total_skip = 0;
+        // track the total number of rows that should not be skipped
+        let mut total_pass = 0;

Review Comment:
   minor nit is that `total_select` might be more consistent terminology here



##########
docs/source/user-guide/explain-usage.md:
##########
@@ -223,6 +223,21 @@ Again, reading from bottom up:
 - `SortPreservingMergeExec`
   - `output_rows=5`, `elapsed_compute=2.375µs`: Produced the final 5 rows in 
2.375µs (microseconds)
 
+When predicate pushdown is enabled, `ParquetExec` gains the following metrics:

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] parquet: Make page_index/pushdown metrics consistent with row_group metrics [datafusion]

Reply via email to