adriangb commented on code in PR #22031:
URL: https://github.com/apache/datafusion/pull/22031#discussion_r3193411270


##########
datafusion/datasource-parquet/src/page_filter.rs:
##########
@@ -230,19 +234,31 @@ impl PagePruningAccessPlanFilter {
                     file_metrics,
                 );
 
-                let Some((selection, total_pages, matched_pages)) = selection 
else {
+                let Some((selection, pages)) = selection else {
                     trace!("No pages pruned in prune_pages_in_one_row_group");
                     continue;
                 };
-                total_pages_select += matched_pages;
-                total_pages_skip += total_pages - matched_pages;
 
                 debug!(
                     "Use filter and page index to create RowSelection {:?} 
from predicate: {:?}",
                     &selection,
                     predicate.predicate_expr(),
                 );
 
+                total_pages_in_group = pages.len();

Review Comment:
   Codex suggested:
   
   If every predicate skips due to a missing column index on line 239, the 
`total_pages_in_group` remains at 0 and `matched_pages_in_group` remains 
`None`. This causes the row group to silently contribute 0 total, 0 matched, 
and 0 pruned to the metrics, even though N pages will actually be scanned. 
While this behavior is not a new regression, it is an inaccuracy worth 
addressing while modifying this part of the codebase.
   
   To fix this, `total_pages_in_group` should be derived from the offset index 
upfront as a property of the row group rather than within the predicate loop. 
By initializing `matched_pages_in_group` to 
`(0..total_pages_in_group).collect()`, an abstaining pruner will correctly 
report $N \to N$ matched and 0 pruned. This refactor also allows the 
`Option<HashSet<_>>` to be simplified into a plain `HashSet<_>` and removes the 
redundant assignment of `total_pages_in_group` inside the loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to