adriangb opened a new issue, #22495:
URL: https://github.com/apache/datafusion/issues/22495

   ### Is your feature request related to a problem or challenge?
   
   Follow-up from #22460.
   
   When opening a Parquet file, the opener decides whether to build a 
`FilePruner` (which can prune the file before/while reading it). After #22460 
that gate is:
   
   ```rust
   predicate
       .as_ref()
       .filter(|p| contains_dynamic_filter(p) || 
partitioned_file.has_statistics())
       .and_then(|p| FilePruner::try_new(...))
   ```
   
   `contains_dynamic_filter` (formerly `is_dynamic_physical_expr`) is a **loose 
proxy**. The pruner can prune a file via two independent mechanisms:
   
   - **(A) Statistics pruning** — the predicate references data columns with 
real min/max/null statistics and is incompatible with them (e.g. `a > 1000` vs 
`max(a) = 3`). Requires `has_statistics()`.
   - **(B) Fold-to-constant pruning** — the opener folds partition values (and 
stats-constant columns) into literals via `replace_columns_with_literals`; if a 
conjunct then collapses to constant `false`, the file is pruned with **no 
column statistics needed**.
   
   `contains_dynamic_filter` is standing in for mechanism (B), but imprecisely:
   
   - **False positive**: a dynamic filter on *data* columns over a stats-less 
file builds a pruner that can never prune (no stats, nothing folds to a 
constant) — a wasted `build_pruning_predicate`.
   - It does not actually check whether folding can produce a constant.
   
   ### Describe the solution you'd like
   
   Gate mechanism (B) on the property that actually enables it: a top-level 
`AND` conjunct whose columns are a subset of the partition (+ stats-constant) 
columns — i.e. one that folds to a constant.
   
   The check must be **per-conjunct**, not over the whole predicate:
   
   | predicate | folds to | (B)-prunable | whole-predicate `cols ⊆ part`? |
   |---|---|---|---|
   | `part = 2` | `1 = 2` | ✅ | ✅ |
   | `part = 2 AND a > 5` | `1=2 AND a>5` | ✅ (the `1=2` term) | ❌ false 
negative |
   | `part = 2 OR a > 5` | `a > 5` | ❌ | ❌ correct |
   | `part = a` | `1 = a` | ❌ | ❌ correct |
   
   Note `part = a` (a partition column compared to a **data** column) is the 
key false positive a naive "references a partition column" check would wrongly 
accept: after folding it is `1 = a`, which still needs `a`'s statistics 
(mechanism A).
   
   Since the opener already has the folded predicate, the check is simply:
   
   ```rust
   split_conjunction(folded_predicate).any(|c| collect_columns(c).is_empty())
   ```
   
   A conjunct with no remaining column references is, by construction, one that 
referenced only partition/constant columns.
   
   The minimal, false-positive-free creation gate becomes:
   
   ```rust
   has_statistics() || (contains_dynamic_filter(p) && 
has_partition_only_conjunct(p))
   ```
   
   - `has_statistics()` → mechanism (A).
   - `contains_dynamic_filter(p) && has_partition_only_conjunct(p)` → mechanism 
(B) that planning could not already do. Static partition pruning is a planning 
concern; only *dynamic* partition filters are unknown at planning time, so the 
`&& contains_dynamic_filter` keeps us from building no-op pruners for static 
partition predicates that planning already handled.
   
   ### Residual subtlety
   
   A constant conjunct can fold to `true` (e.g. static `part = 1` on a file 
actually in `part=1`). Excluding it perfectly requires *evaluating* the 
constant, which is essentially the prune itself. In practice these are removed 
by planning-time partition pruning, so the leftover is a cheap no-op build.
   
   ### Test matrix to add
   
   - `part = 2`, static vs dynamic, file in matching / non-matching partition
   - `part = 2 AND a > 5` (must still build a pruner: the `part = 2` term 
prunes)
   - `part = 2 OR a > 5` (must *not* build via mechanism B)
   - `part = a` — partition column vs data column — must *not* build a pruner 
on a stats-less file
   - stats-constant column folding (column with `min == max`)
   
   ### Describe alternatives you've considered
   
   Keeping the current `contains_dynamic_filter(p) || has_statistics()` gate 
(what #22460 ships). It is behavior-preserving relative to the previous 
`is_dynamic_physical_expr(p) || has_statistics()` and correct, just imprecise 
at the margin.
   
   ### Additional context
   
   Deferred from #22460, which only removed the `FilePruner`/opener dependency 
on `snapshot_generation` (behavior-preserving). This is a behavior change to 
*which files get a pruner*. In practice real Parquet files carry statistics, so 
`has_statistics()` short-circuits the gate and this refinement only affects 
stats-less files (rare) — making it a correctness/clarity improvement rather 
than a hot-path win, and worth its own focused PR with the test matrix above.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to