[PR] [python] Fix predicate index stale after with_projection narrows read_type [paimon]

via GitHub Wed, 29 Apr 2026 06:08:32 -0700


TheR1sing3un opened a new pull request, #7741:
URL: https://github.com/apache/paimon/pull/7741


   ## Purpose
   
   Predicate leaves are built against the *original* table schema via 
`PredicateBuilder`, so each leaf's `index` field encodes a position in that 
schema's column order. `SplitRead.__init__` then takes the predicate as-is and 
forwards it to the row-level `FilterRecordReader`.
   
   When the caller chains `.with_projection(...)` onto the read builder, 
`read_type` becomes narrower or reordered relative to the original schema. The 
`OffsetRow` handed to `FilterRecordReader` uses `read_type` indices, so the 
stale `index` carried on the predicate either
   
   * raises `"Position N is out of bounds for row arity M"` when `N` is past 
the end of the projected row, or
   * silently selects the wrong column when `N` is still in range but no longer 
points at the predicate field.
   
   This shows up most easily on PK tables (the row-level filter path) when a 
non-PK column is filtered while the projection narrows the read.
   
   ## Fix
   
   Add `rewrite_predicate_indices` in `push_down_utils.py` and call it from 
`SplitRead.__init__`, against `read_type` rather than `self.read_fields`. The 
distinction matters:
   
   * `MergeFileSplitRead` augments `read_fields` with `_KEY_*/_SEQ/_KIND` 
prefixes for the underlying KV scan.
   * `KeyValueUnwrapRecordReader` returns `kv.value` whose arity equals 
`len(read_type)` and whose coordinate space is `read_type`.
   
   `FilterRecordReader` evaluates against the unwrapped value, so `read_type` 
is the correct coordinate space for both `MergeFileSplitRead` and 
`RawFileSplitRead`. The membership check (whether to apply the filter at all) 
also moves onto `read_type` for the same reason — a predicate field that is in 
`read_fields` only via a `_KEY_*` augmentation is not visible to 
`FilterRecordReader`.
   
   Helper contract: every predicate leaf field must be present in 
`read_fields`, otherwise the helper raises with a clear message; the caller in 
`SplitRead` already filters out the off-projection case before invoking the 
rewrite.
   
   ## Linked issue
   
   N/A — surfaced when reading a PK table with `with_projection(['id', 
'value']).with_filter(equal('value', ...))`. The pre-fix path raised 
`IndexError` from `OffsetRow.get_field`.
   
   ## Tests
   
   New `pypaimon/tests/projection_predicate_index_test.py`:
   
   End-to-end (kernel `ReadBuilder`, no extra integrations):
   
   * `test_pk_filter_on_non_pk_with_projection_keeping_filter_column` — was the 
`IndexError` trigger.
   * `test_pk_filter_on_non_pk_with_projection_reordering_columns` — projection 
reorders columns; remap must follow.
   * `test_pk_filter_no_projection_still_works` — sanity: no-projection path 
matches pre-fix output.
   * `test_pk_filter_column_outside_projection_drops_filter` — when the filter 
column is not projected the filter cannot apply at row level; we drop it, we do 
not misapply with a stale index.
   * `test_append_only_filter_with_projection_unchanged` — append-only 
file-level pushdown uses field names rather than indices; guard against any 
regression introduced by the fix.
   
   Direct unit tests for the helper:
   
   * `test_remaps_leaf_index_to_position_in_read_fields` — single leaf, 
narrowed/reordered projection.
   * `test_remaps_inside_and_or` — nested AND of leaves and a nested OR; every 
leaf rebound.
   * `test_returns_none_for_none_input`.
   * `test_raises_when_leaf_field_missing` — leaf field absent from 
`read_fields` raises a clear `ValueError`.
   
   Local: `pytest pypaimon/tests/projection_predicate_index_test.py` → 9 
passed; `flake8 --config=dev/cfg.ini` clean. Surrounding test suites 
(`predicates_test`, `reader_primary_key_test`, `reader_append_only_test`, 
`data_evolution_test`) show no new failures vs `origin/master`; the only 
failures in those files are the pre-existing lance / vortex environment issues.
   
   ## API and format
   
   No public API change. No file format change. The behavioural change is 
restricted to scans that combine `with_filter()` and `with_projection()` and 
previously either crashed or returned wrong rows.
   
   ## Documentation
   
   The new helper carries a docstring explaining when and why the index remap 
is needed; the call site in `SplitRead.__init__` keeps a long-form comment 
documenting the `read_type` vs `read_fields` distinction so future maintainers 
don't accidentally regress.
   
   ## Generative AI disclosure
   
   Drafted with assistance from an AI coding tool; root cause and fix verified 
by the author against the row-level `FilterRecordReader` path on PK tables, and 
exercised by the regression tests above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [python] Fix predicate index stale after with_projection narrows read_type [paimon]

Reply via email to