TheR1sing3un opened a new pull request, #7796:
URL: https://github.com/apache/paimon/pull/7796
## Purpose
`ReadBuilder.with_projection(List[str])` only accepts top-level column names
today; dotted forms like `'struct.subfield'` are silently dropped, and the
low-level "project by integer paths" API doesn't exist. The reader cannot push
nested column pruning down to the PyArrow scanner (Parquet/ORC) or to fastavro,
so nested projection over a struct column means materialising the whole struct
and discarding the unwanted children client-side.
This PR ports the nested-projection feature for **append-only** read paths.
Three commits laid out in phases for review:
1. `[python] Add Projection utility and nested-projection API on ReadBuilder`
* New `pypaimon/utils/projection.py` with `Projection` ABC +
`TopLevelProjection` / `NestedProjection` / `_EmptyProjection`. Factories:
`Projection.of(...)`, `empty()`, `range(start, end)`.
* `ReadBuilder.with_projection(['a.b.c'])` now resolves dotted names into
integer paths.
* `ReadBuilder.with_nested_projection(int[][])` low-level entry point.
* Nested leaves flatten to underscore-joined names (`a_b` for `a.b`,
`_$N` suffix on collisions); leaf field IDs are preserved so schema-evolution
remap still works.
2. `[python] Push down nested-field projection to PyArrow scanner for
append-only`
* `FormatPyArrowReader` switches to a dict-form
`dataset.scanner(columns={...})` with `ds.field(*path)` when at least one path
has length > 1. Sub-field schema evolution surfaces as a NULL column via
`_path_exists_in_arrow_schema`.
* `SplitRead` threads `nested_name_paths` through, with three small
helpers (`_nested_path_by_name`, nested-aware `_get_fields_and_predicate`
reachability, `_get_final_read_data_fields` short-circuit).
`create_index_mapping` returns identity in nested mode because the reader emits
batches whose columns already match `read_fields`.
* `_construct_partition_mapping` has a parallel nested-mode bypass so
partitioned tables don't drop non-projected top-level columns.
* Variant shredded reassembly is preserved per-column when nested mode is
active (only sub-field walks skip it).
* Avro / Lance / Vortex / Blob raise `NotImplementedError` when a nested
path reaches them.
* Primary-key tables that need a merge read raise `NotImplementedError`;
raw-convertible PK splits work via the append-only path. Data-evolution tables
also raise (multi-file union by leaf id would silently produce NULLs).
3. `[python] Implement Python-side nested-projection fallback for Avro`
* fastavro has no native nested column pruning; the reader walks each
record dict step-by-step using `_walk_avro_record`. Top-level-only projection
keeps the existing `record.get(name)` fast path.
## Linked issue / design
Surfaced while landing the row-tracking + ML-feature read paths internally:
those tables store `mv ROW<latest_version, latest_value>` style sub-fields and
the cost of materialising the parent struct just to read one leaf was the
dominant scan cost.
Java reference points (for alignment context, not literal port):
- `paimon-common/.../utils/Projection.java` — abstract base + three concrete
subclasses
- `paimon-api/.../types/RowType.java:project(int[]/int[][]/List<String>)` —
semantics of "project a row type"
- `paimon-format/.../parquet/ParquetReaderFactory.java:clipParquetType` —
nested column pruning
- `paimon-flink/.../flink/Projection.java:getOuterProjectRow` +
`NestedProjectedRowData` — outer extraction (see "Out of scope" below)
## Tests
* `pypaimon/tests/test_projection_utility.py` — 21 unit cases covering
top-level / nested / empty projections, factory dispatch, mixed-input
rejection, leaf field-ID preservation, `_$N` collision dedup (per-call
monotonic).
* `pypaimon/tests/test_read_builder_nested_projection.py` — 10 cases for
`with_projection` / `with_nested_projection` state transitions, dotted-name
resolution, unknown-column silent skip, no row-tracking injection without
explicit projection.
* `pypaimon/tests/test_nested_projection_e2e.py` — 8 e2e cases: dotted
projection on Parquet, mixed nested/top-level reorder, low-level integer-path
API, top-level fast path unchanged, partitioned table nested projection
(regression for non-projected top-level columns being dropped), Avro
Python-side fallback, Avro top-level unchanged, PK + merge raises clear
`NotImplementedError`.
Local: `pytest pypaimon/tests/test_projection_utility.py
pypaimon/tests/test_read_builder_nested_projection.py
pypaimon/tests/test_nested_projection_e2e.py` → 39 passed; regression on
`reader_append_only_test.py` / `reader_primary_key_test.py` /
`file_store_commit_test.py` / `streaming_table_scan_test.py` /
`partition_predicate_test.py` / `projection_predicate_index_test.py` clean
(pre-existing lance/vortex env failures unrelated). `flake8
--config=dev/cfg.ini` clean.
## Out of scope (separate PR)
* **PK tables that go through `MergeFileSplitRead`**: the merge function
needs the full parent struct to operate, so the implementation needs an
outer-extraction layer (`OuterProjectionRecordReader` walking paths after
merge). Tracked as a follow-up; the PR raises `NotImplementedError` clearly
until then.
* **`AggregateMergeFunction` + projection regression tests**: depends on the
aggregation merge engine port not yet on master.
* **`ARRAY<ROW>` / `MAP` nested projection**: `NestedProjection` only walks
`RowType` children. Aligns with the Java side.
* **Sub-field schema evolution by ID**: nested paths walk by name; renaming
a parent struct or a leaf surfaces as a NULL column. Top-level projection still
uses field-ID remap. Documented; addressing it cleanly needs reading
Parquet/ORC field-id metadata from the file footer.
* **Avro native nested column pruning**: fastavro doesn't expose one;
Python-side walk is the workaround.
## API and format
Public Python API additions:
* `pypaimon.utils.projection.Projection` (and concrete subclasses).
* `ReadBuilder.with_projection(['a.b'])` — backward compatible:
top-level-only callers see the same observable behaviour as before; the dotted
form is opt-in.
* `ReadBuilder.with_nested_projection(int[][])` — new low-level entry point.
No file format change.
## Documentation
Public docstrings on `Projection`, `with_projection`,
`with_nested_projection`, and `OuterProjectionRecordReader` (in commit 3 —
pending follow-up PR for the PK path) describe the new contract.
## Generative AI disclosure
Drafted with assistance from an AI coding tool; every behavioural guarantee
made by the new APIs is exercised by a test in one of the three new test files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]