TheR1sing3un opened a new pull request, #7796:
URL: https://github.com/apache/paimon/pull/7796

   ## Purpose
   
   `ReadBuilder.with_projection(List[str])` only accepts top-level column names 
today; dotted forms like `'struct.subfield'` are silently dropped, and the 
low-level "project by integer paths" API doesn't exist. The reader cannot push 
nested column pruning down to the PyArrow scanner (Parquet/ORC) or to fastavro, 
so nested projection over a struct column means materialising the whole struct 
and discarding the unwanted children client-side.
   
   This PR ports the nested-projection feature for **append-only** read paths. 
Three commits laid out in phases for review:
   
   1. `[python] Add Projection utility and nested-projection API on ReadBuilder`
      * New `pypaimon/utils/projection.py` with `Projection` ABC + 
`TopLevelProjection` / `NestedProjection` / `_EmptyProjection`. Factories: 
`Projection.of(...)`, `empty()`, `range(start, end)`.
      * `ReadBuilder.with_projection(['a.b.c'])` now resolves dotted names into 
integer paths.
      * `ReadBuilder.with_nested_projection(int[][])` low-level entry point.
      * Nested leaves flatten to underscore-joined names (`a_b` for `a.b`, 
`_$N` suffix on collisions); leaf field IDs are preserved so schema-evolution 
remap still works.
   
   2. `[python] Push down nested-field projection to PyArrow scanner for 
append-only`
      * `FormatPyArrowReader` switches to a dict-form 
`dataset.scanner(columns={...})` with `ds.field(*path)` when at least one path 
has length > 1. Sub-field schema evolution surfaces as a NULL column via 
`_path_exists_in_arrow_schema`.
      * `SplitRead` threads `nested_name_paths` through, with three small 
helpers (`_nested_path_by_name`, nested-aware `_get_fields_and_predicate` 
reachability, `_get_final_read_data_fields` short-circuit). 
`create_index_mapping` returns identity in nested mode because the reader emits 
batches whose columns already match `read_fields`.
      * `_construct_partition_mapping` has a parallel nested-mode bypass so 
partitioned tables don't drop non-projected top-level columns.
      * Variant shredded reassembly is preserved per-column when nested mode is 
active (only sub-field walks skip it).
      * Avro / Lance / Vortex / Blob raise `NotImplementedError` when a nested 
path reaches them.
      * Primary-key tables that need a merge read raise `NotImplementedError`; 
raw-convertible PK splits work via the append-only path. Data-evolution tables 
also raise (multi-file union by leaf id would silently produce NULLs).
   
   3. `[python] Implement Python-side nested-projection fallback for Avro`
      * fastavro has no native nested column pruning; the reader walks each 
record dict step-by-step using `_walk_avro_record`. Top-level-only projection 
keeps the existing `record.get(name)` fast path.
   
   ## Linked issue / design
   
   Surfaced while landing the row-tracking + ML-feature read paths internally: 
those tables store `mv ROW<latest_version, latest_value>` style sub-fields and 
the cost of materialising the parent struct just to read one leaf was the 
dominant scan cost.
   
   Java reference points (for alignment context, not literal port):
   - `paimon-common/.../utils/Projection.java` — abstract base + three concrete 
subclasses
   - `paimon-api/.../types/RowType.java:project(int[]/int[][]/List<String>)` — 
semantics of "project a row type"
   - `paimon-format/.../parquet/ParquetReaderFactory.java:clipParquetType` — 
nested column pruning
   - `paimon-flink/.../flink/Projection.java:getOuterProjectRow` + 
`NestedProjectedRowData` — outer extraction (see "Out of scope" below)
   
   ## Tests
   
   * `pypaimon/tests/test_projection_utility.py` — 21 unit cases covering 
top-level / nested / empty projections, factory dispatch, mixed-input 
rejection, leaf field-ID preservation, `_$N` collision dedup (per-call 
monotonic).
   * `pypaimon/tests/test_read_builder_nested_projection.py` — 10 cases for 
`with_projection` / `with_nested_projection` state transitions, dotted-name 
resolution, unknown-column silent skip, no row-tracking injection without 
explicit projection.
   * `pypaimon/tests/test_nested_projection_e2e.py` — 8 e2e cases: dotted 
projection on Parquet, mixed nested/top-level reorder, low-level integer-path 
API, top-level fast path unchanged, partitioned table nested projection 
(regression for non-projected top-level columns being dropped), Avro 
Python-side fallback, Avro top-level unchanged, PK + merge raises clear 
`NotImplementedError`.
   
   Local: `pytest pypaimon/tests/test_projection_utility.py 
pypaimon/tests/test_read_builder_nested_projection.py 
pypaimon/tests/test_nested_projection_e2e.py` → 39 passed; regression on 
`reader_append_only_test.py` / `reader_primary_key_test.py` / 
`file_store_commit_test.py` / `streaming_table_scan_test.py` / 
`partition_predicate_test.py` / `projection_predicate_index_test.py` clean 
(pre-existing lance/vortex env failures unrelated). `flake8 
--config=dev/cfg.ini` clean.
   
   ## Out of scope (separate PR)
   
   * **PK tables that go through `MergeFileSplitRead`**: the merge function 
needs the full parent struct to operate, so the implementation needs an 
outer-extraction layer (`OuterProjectionRecordReader` walking paths after 
merge). Tracked as a follow-up; the PR raises `NotImplementedError` clearly 
until then.
   * **`AggregateMergeFunction` + projection regression tests**: depends on the 
aggregation merge engine port not yet on master.
   * **`ARRAY<ROW>` / `MAP` nested projection**: `NestedProjection` only walks 
`RowType` children. Aligns with the Java side.
   * **Sub-field schema evolution by ID**: nested paths walk by name; renaming 
a parent struct or a leaf surfaces as a NULL column. Top-level projection still 
uses field-ID remap. Documented; addressing it cleanly needs reading 
Parquet/ORC field-id metadata from the file footer.
   * **Avro native nested column pruning**: fastavro doesn't expose one; 
Python-side walk is the workaround.
   
   ## API and format
   
   Public Python API additions:
   * `pypaimon.utils.projection.Projection` (and concrete subclasses).
   * `ReadBuilder.with_projection(['a.b'])` — backward compatible: 
top-level-only callers see the same observable behaviour as before; the dotted 
form is opt-in.
   * `ReadBuilder.with_nested_projection(int[][])` — new low-level entry point.
   
   No file format change.
   
   ## Documentation
   
   Public docstrings on `Projection`, `with_projection`, 
`with_nested_projection`, and `OuterProjectionRecordReader` (in commit 3 — 
pending follow-up PR for the PK path) describe the new contract.
   
   ## Generative AI disclosure
   
   Drafted with assistance from an AI coding tool; every behavioural guarantee 
made by the new APIs is exercised by a test in one of the three new test files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to