TheR1sing3un opened a new pull request, #7804:
URL: https://github.com/apache/paimon/pull/7804
## Purpose
Wraps up the per-partition predicate pre-evaluation gap that PR #7744
(``BucketSelectConverter`` introduction) deliberately punted on, recorded
as a TODO at the bottom of the module docstring:
> Predicates of the form ``(part='a' AND bk IN (1,2)) OR (part='b' AND bk
> IN (3,4))`` currently fall through to "no pruning" because the top-level
> OR mixes partition and bucket-key constraints. Java simplifies the
> predicate per concrete partition value first […], so each partition
> gets a tighter bucket-key predicate and the corresponding bucket set.
Java's ``BucketSelector.test(BinaryRow partition, Integer bucket, Integer
numBucket)`` and ``PartitionValuePredicateVisitor`` already handle this;
this PR brings the Python side in line.
## What changed
**Commit 1** — ``[python] Add per-partition bucket pruning helper``:
- ``replace_partition_predicate(predicate, partition_field_names,
partition_values)``: new walker that substitutes partition leaves with
their concrete truth value and folds AND/OR. Three-way return — ``None``
(cleared), ``False`` (always false), or the simplified ``Predicate``.
- ``_Selector`` now keys its cache by ``(partition_tuple, total_buckets)``.
``__call__`` accepts both ``(bucket, total_buckets)`` (legacy / early
manifest filter) and ``(partition, bucket, total_buckets)`` (late
filter on a fully decoded entry).
- ``create_bucket_selector`` takes an optional ``partition_fields`` list.
Without it (or with a predicate that doesn't touch any partition
column), the selector keeps the existing partition-agnostic shape.
- 9 new unit cases covering ``replace_partition_predicate`` folding, the
per-partition cache, fall-through when partition is unknown, and the
empty-bucket-set result for an unsatisfiable partition.
**Commit 2** — ``[python] Wire per-partition bucket pruning into
FileScanner``:
- ``_filter_manifest_entry`` now calls the selector with
``entry.partition``.
- ``_create_bucket_selector`` passes the table's partition fields into
``create_bucket_selector``.
- The early manifest filter still uses the two-arg form because the
partition row hasn't been deserialised at that stage; the selector
internally falls back to a sound partition-agnostic
over-approximation there.
- One e2e test on a two-partition × four-bucket table proves
``(part='a' AND id=1) OR (part='b' AND id=2)`` now gets pruned to
≤ 2 splits instead of one per ``(partition, bucket)`` combination.
## Soundness
The bucket set the selector returns remains a *superset* of the buckets
that contain matching rows — same hard contract as PR #7744. False
positives (over-keep) allowed; false-negatives (drop a bucket with
matches) MUST never happen. Any error in partition substitution / hashing
falls open to "all buckets accept", just like the existing fail-open path.
## Tests
- 9 new unit cases for the helper + selector in
``PartitionAwareBucketSelectorUnitTest``.
- 1 new integration case in ``BucketPruningIntegrationTest`` covering
the mixed-OR end-to-end.
- All 33 existing ``pushdown_bucket_test.py`` cases (BucketSelectConverter
unit / Integration / Property) still pass — the partition-agnostic path
is byte-for-byte unchanged.
## Out of scope
- ARRAY / MAP / VARIANT partition columns: the existing
``_UNSAFE_BUCKET_KEY_TYPES`` gate already covers similar concerns on
the bucket-key side; partition columns hit the same fail-open path
via ``_evaluate_partition_leaf``'s exception handler.
- ``MAX_VALUES`` cap continues to apply per-partition (matches Java).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]