adriangb opened a new issue, #22495:
URL: https://github.com/apache/datafusion/issues/22495
### Is your feature request related to a problem or challenge?
Follow-up from #22460.
When opening a Parquet file, the opener decides whether to build a
`FilePruner` (which can prune the file before/while reading it). After #22460
that gate is:
```rust
predicate
.as_ref()
.filter(|p| contains_dynamic_filter(p) ||
partitioned_file.has_statistics())
.and_then(|p| FilePruner::try_new(...))
```
`contains_dynamic_filter` (formerly `is_dynamic_physical_expr`) is a **loose
proxy**. The pruner can prune a file via two independent mechanisms:
- **(A) Statistics pruning** — the predicate references data columns with
real min/max/null statistics and is incompatible with them (e.g. `a > 1000` vs
`max(a) = 3`). Requires `has_statistics()`.
- **(B) Fold-to-constant pruning** — the opener folds partition values (and
stats-constant columns) into literals via `replace_columns_with_literals`; if a
conjunct then collapses to constant `false`, the file is pruned with **no
column statistics needed**.
`contains_dynamic_filter` is standing in for mechanism (B), but imprecisely:
- **False positive**: a dynamic filter on *data* columns over a stats-less
file builds a pruner that can never prune (no stats, nothing folds to a
constant) — a wasted `build_pruning_predicate`.
- It does not actually check whether folding can produce a constant.
### Describe the solution you'd like
Gate mechanism (B) on the property that actually enables it: a top-level
`AND` conjunct whose columns are a subset of the partition (+ stats-constant)
columns — i.e. one that folds to a constant.
The check must be **per-conjunct**, not over the whole predicate:
| predicate | folds to | (B)-prunable | whole-predicate `cols ⊆ part`? |
|---|---|---|---|
| `part = 2` | `1 = 2` | ✅ | ✅ |
| `part = 2 AND a > 5` | `1=2 AND a>5` | ✅ (the `1=2` term) | ❌ false
negative |
| `part = 2 OR a > 5` | `a > 5` | ❌ | ❌ correct |
| `part = a` | `1 = a` | ❌ | ❌ correct |
Note `part = a` (a partition column compared to a **data** column) is the
key false positive a naive "references a partition column" check would wrongly
accept: after folding it is `1 = a`, which still needs `a`'s statistics
(mechanism A).
Since the opener already has the folded predicate, the check is simply:
```rust
split_conjunction(folded_predicate).any(|c| collect_columns(c).is_empty())
```
A conjunct with no remaining column references is, by construction, one that
referenced only partition/constant columns.
The minimal, false-positive-free creation gate becomes:
```rust
has_statistics() || (contains_dynamic_filter(p) &&
has_partition_only_conjunct(p))
```
- `has_statistics()` → mechanism (A).
- `contains_dynamic_filter(p) && has_partition_only_conjunct(p)` → mechanism
(B) that planning could not already do. Static partition pruning is a planning
concern; only *dynamic* partition filters are unknown at planning time, so the
`&& contains_dynamic_filter` keeps us from building no-op pruners for static
partition predicates that planning already handled.
### Residual subtlety
A constant conjunct can fold to `true` (e.g. static `part = 1` on a file
actually in `part=1`). Excluding it perfectly requires *evaluating* the
constant, which is essentially the prune itself. In practice these are removed
by planning-time partition pruning, so the leftover is a cheap no-op build.
### Test matrix to add
- `part = 2`, static vs dynamic, file in matching / non-matching partition
- `part = 2 AND a > 5` (must still build a pruner: the `part = 2` term
prunes)
- `part = 2 OR a > 5` (must *not* build via mechanism B)
- `part = a` — partition column vs data column — must *not* build a pruner
on a stats-less file
- stats-constant column folding (column with `min == max`)
### Describe alternatives you've considered
Keeping the current `contains_dynamic_filter(p) || has_statistics()` gate
(what #22460 ships). It is behavior-preserving relative to the previous
`is_dynamic_physical_expr(p) || has_statistics()` and correct, just imprecise
at the margin.
### Additional context
Deferred from #22460, which only removed the `FilePruner`/opener dependency
on `snapshot_generation` (behavior-preserving). This is a behavior change to
*which files get a pruner*. In practice real Parquet files carry statistics, so
`has_statistics()` short-circuits the gate and this refinement only affects
stats-less files (rare) — making it a correctness/clarity improvement rather
than a hot-path win, and worth its own focused PR with the test matrix above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]