Dandandan opened a new pull request, #21372:
URL: https://github.com/apache/datafusion/pull/21372

   ## Which issue does this PR close?
   
   N/A - Performance optimization
   
   ## Rationale for this change
   
   When statistics prove that every remaining row group fully satisfies the 
filter predicate (i.e., the row group is "fully matched"), the per-row filter 
evaluation is unnecessary overhead. DataFusion already identifies fully matched 
row groups via `identify_fully_matched_row_groups()`, but this information is 
currently only used for LIMIT-based pruning.
   
   This is particularly relevant for ClickBench-style queries with filters like 
`WHERE col <> 0` or `WHERE col <> ''`, where min/max statistics often show the 
filter is trivially true for all row groups (e.g., `min > 0` means no values 
are zero).
   
   ## What changes are included in this PR?
   
   After all row group pruning (statistics, bloom filters, limit), check if 
every remaining row group is fully matched by the predicate. If so, drop the 
per-row filter from the Parquet decoder builder entirely.
   
   ## Are these changes tested?
   
   Existing parquet integration tests pass (198 tests). The optimization is 
transparent — it produces the same results, just avoids redundant filter 
evaluation.
   
   ## Are there any user-facing changes?
   
   No. This is a performance optimization that skips unnecessary work. Query 
results are unchanged.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to