Dandandan opened a new pull request, #21372: URL: https://github.com/apache/datafusion/pull/21372
## Which issue does this PR close? N/A - Performance optimization ## Rationale for this change When statistics prove that every remaining row group fully satisfies the filter predicate (i.e., the row group is "fully matched"), the per-row filter evaluation is unnecessary overhead. DataFusion already identifies fully matched row groups via `identify_fully_matched_row_groups()`, but this information is currently only used for LIMIT-based pruning. This is particularly relevant for ClickBench-style queries with filters like `WHERE col <> 0` or `WHERE col <> ''`, where min/max statistics often show the filter is trivially true for all row groups (e.g., `min > 0` means no values are zero). ## What changes are included in this PR? After all row group pruning (statistics, bloom filters, limit), check if every remaining row group is fully matched by the predicate. If so, drop the per-row filter from the Parquet decoder builder entirely. ## Are these changes tested? Existing parquet integration tests pass (198 tests). The optimization is transparent — it produces the same results, just avoids redundant filter evaluation. ## Are there any user-facing changes? No. This is a performance optimization that skips unnecessary work. Query results are unchanged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
