[PR] [python] Add schema short-circuit to SplitRead and FileScanner read paths [paimon]

via GitHub Thu, 11 Jun 2026 23:45:27 -0700


MgjLLL opened a new pull request, #8217:
URL: https://github.com/apache/paimon/pull/8217


   ### Purpose
   
   Fix redundant filesystem I/O in `SplitRead` and `FileScanner` when reading 
schema.
   
   `SplitRead` has 3 call sites that unconditionally call 
`schema_manager.get_schema(schema_id)` even when `schema_id == 
table.table_schema.id` — the schema is already in memory. This causes 
unnecessary filesystem reads in the common case (no schema evolution).
   
   Java equivalent (`RawFileSplitRead.createFileReader()`) short-circuits with:
   ```java
   schemaId == schema.id() ? schema : schemaManager.schema(schemaId)
   ```
   
   ### Changes
   
   - `split_read.py`: Add `_resolve_schema()` method that returns in-memory 
schema when id matches, replacing 3 direct `get_schema()` calls in 
`raw_reader_supplier`, `_get_fields_and_predicate`, and `_file_read_fields`
   - `file_scanner.py`: Add `_schema_fields()` method with same short-circuit 
pattern for `SimpleStatsEvolutions`
   
   ### Tests
   
   - Added `file_scanner_schema_fields_test.py` with 3 test cases covering 
short-circuit, delegation, and zero-id edge case
   - All existing tests pass (106 passed)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [python] Add schema short-circuit to SplitRead and FileScanner read paths [paimon]

Reply via email to