TheR1sing3un opened a new pull request, #7741:
URL: https://github.com/apache/paimon/pull/7741
## Purpose
Predicate leaves are built against the *original* table schema via
`PredicateBuilder`, so each leaf's `index` field encodes a position in that
schema's column order. `SplitRead.__init__` then takes the predicate as-is and
forwards it to the row-level `FilterRecordReader`.
When the caller chains `.with_projection(...)` onto the read builder,
`read_type` becomes narrower or reordered relative to the original schema. The
`OffsetRow` handed to `FilterRecordReader` uses `read_type` indices, so the
stale `index` carried on the predicate either
* raises `"Position N is out of bounds for row arity M"` when `N` is past
the end of the projected row, or
* silently selects the wrong column when `N` is still in range but no longer
points at the predicate field.
This shows up most easily on PK tables (the row-level filter path) when a
non-PK column is filtered while the projection narrows the read.
## Fix
Add `rewrite_predicate_indices` in `push_down_utils.py` and call it from
`SplitRead.__init__`, against `read_type` rather than `self.read_fields`. The
distinction matters:
* `MergeFileSplitRead` augments `read_fields` with `_KEY_*/_SEQ/_KIND`
prefixes for the underlying KV scan.
* `KeyValueUnwrapRecordReader` returns `kv.value` whose arity equals
`len(read_type)` and whose coordinate space is `read_type`.
`FilterRecordReader` evaluates against the unwrapped value, so `read_type`
is the correct coordinate space for both `MergeFileSplitRead` and
`RawFileSplitRead`. The membership check (whether to apply the filter at all)
also moves onto `read_type` for the same reason — a predicate field that is in
`read_fields` only via a `_KEY_*` augmentation is not visible to
`FilterRecordReader`.
Helper contract: every predicate leaf field must be present in
`read_fields`, otherwise the helper raises with a clear message; the caller in
`SplitRead` already filters out the off-projection case before invoking the
rewrite.
## Linked issue
N/A — surfaced when reading a PK table with `with_projection(['id',
'value']).with_filter(equal('value', ...))`. The pre-fix path raised
`IndexError` from `OffsetRow.get_field`.
## Tests
New `pypaimon/tests/projection_predicate_index_test.py`:
End-to-end (kernel `ReadBuilder`, no extra integrations):
* `test_pk_filter_on_non_pk_with_projection_keeping_filter_column` — was the
`IndexError` trigger.
* `test_pk_filter_on_non_pk_with_projection_reordering_columns` — projection
reorders columns; remap must follow.
* `test_pk_filter_no_projection_still_works` — sanity: no-projection path
matches pre-fix output.
* `test_pk_filter_column_outside_projection_drops_filter` — when the filter
column is not projected the filter cannot apply at row level; we drop it, we do
not misapply with a stale index.
* `test_append_only_filter_with_projection_unchanged` — append-only
file-level pushdown uses field names rather than indices; guard against any
regression introduced by the fix.
Direct unit tests for the helper:
* `test_remaps_leaf_index_to_position_in_read_fields` — single leaf,
narrowed/reordered projection.
* `test_remaps_inside_and_or` — nested AND of leaves and a nested OR; every
leaf rebound.
* `test_returns_none_for_none_input`.
* `test_raises_when_leaf_field_missing` — leaf field absent from
`read_fields` raises a clear `ValueError`.
Local: `pytest pypaimon/tests/projection_predicate_index_test.py` → 9
passed; `flake8 --config=dev/cfg.ini` clean. Surrounding test suites
(`predicates_test`, `reader_primary_key_test`, `reader_append_only_test`,
`data_evolution_test`) show no new failures vs `origin/master`; the only
failures in those files are the pre-existing lance / vortex environment issues.
## API and format
No public API change. No file format change. The behavioural change is
restricted to scans that combine `with_filter()` and `with_projection()` and
previously either crashed or returned wrong rows.
## Documentation
The new helper carries a docstring explaining when and why the index remap
is needed; the call site in `SplitRead.__init__` keeps a long-form comment
documenting the `read_type` vs `read_fields` distinction so future maintainers
don't accidentally regress.
## Generative AI disclosure
Drafted with assistance from an AI coding tool; root cause and fix verified
by the author against the row-level `FilterRecordReader` path on PK tables, and
exercised by the regression tests above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]