[PR] [python] Add per-partition bucket pruning for HASH_FIXED tables [paimon]

via GitHub Sun, 10 May 2026 00:44:55 -0700


TheR1sing3un opened a new pull request, #7804:
URL: https://github.com/apache/paimon/pull/7804


   ## Purpose
   
   Wraps up the per-partition predicate pre-evaluation gap that PR #7744
   (``BucketSelectConverter`` introduction) deliberately punted on, recorded
   as a TODO at the bottom of the module docstring:
   
   > Predicates of the form ``(part='a' AND bk IN (1,2)) OR (part='b' AND bk
   > IN (3,4))`` currently fall through to "no pruning" because the top-level
   > OR mixes partition and bucket-key constraints. Java simplifies the
   > predicate per concrete partition value first […], so each partition
   > gets a tighter bucket-key predicate and the corresponding bucket set.
   
   Java's ``BucketSelector.test(BinaryRow partition, Integer bucket, Integer
   numBucket)`` and ``PartitionValuePredicateVisitor`` already handle this;
   this PR brings the Python side in line.
   
   ## What changed
   
   **Commit 1** — ``[python] Add per-partition bucket pruning helper``:
   - ``replace_partition_predicate(predicate, partition_field_names,
     partition_values)``: new walker that substitutes partition leaves with
     their concrete truth value and folds AND/OR. Three-way return — ``None``
     (cleared), ``False`` (always false), or the simplified ``Predicate``.
   - ``_Selector`` now keys its cache by ``(partition_tuple, total_buckets)``.
     ``__call__`` accepts both ``(bucket, total_buckets)`` (legacy / early
     manifest filter) and ``(partition, bucket, total_buckets)`` (late
     filter on a fully decoded entry).
   - ``create_bucket_selector`` takes an optional ``partition_fields`` list.
     Without it (or with a predicate that doesn't touch any partition
     column), the selector keeps the existing partition-agnostic shape.
   - 9 new unit cases covering ``replace_partition_predicate`` folding, the
     per-partition cache, fall-through when partition is unknown, and the
     empty-bucket-set result for an unsatisfiable partition.
   
   **Commit 2** — ``[python] Wire per-partition bucket pruning into
   FileScanner``:
   - ``_filter_manifest_entry`` now calls the selector with
     ``entry.partition``.
   - ``_create_bucket_selector`` passes the table's partition fields into
     ``create_bucket_selector``.
   - The early manifest filter still uses the two-arg form because the
     partition row hasn't been deserialised at that stage; the selector
     internally falls back to a sound partition-agnostic
     over-approximation there.
   - One e2e test on a two-partition × four-bucket table proves
     ``(part='a' AND id=1) OR (part='b' AND id=2)`` now gets pruned to
     ≤ 2 splits instead of one per ``(partition, bucket)`` combination.
   
   ## Soundness
   
   The bucket set the selector returns remains a *superset* of the buckets
   that contain matching rows — same hard contract as PR #7744. False
   positives (over-keep) allowed; false-negatives (drop a bucket with
   matches) MUST never happen. Any error in partition substitution / hashing
   falls open to "all buckets accept", just like the existing fail-open path.
   
   ## Tests
   
   - 9 new unit cases for the helper + selector in
     ``PartitionAwareBucketSelectorUnitTest``.
   - 1 new integration case in ``BucketPruningIntegrationTest`` covering
     the mixed-OR end-to-end.
   - All 33 existing ``pushdown_bucket_test.py`` cases (BucketSelectConverter
     unit / Integration / Property) still pass — the partition-agnostic path
     is byte-for-byte unchanged.
   
   ## Out of scope
   
   - ARRAY / MAP / VARIANT partition columns: the existing
     ``_UNSAFE_BUCKET_KEY_TYPES`` gate already covers similar concerns on
     the bucket-key side; partition columns hit the same fail-open path
     via ``_evaluate_partition_leaf``'s exception handler.
   - ``MAX_VALUES`` cap continues to apply per-partition (matches Java).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [python] Add per-partition bucket pruning for HASH_FIXED tables [paimon]

Reply via email to