TheR1sing3un commented on code in PR #8187:
URL: https://github.com/apache/paimon/pull/8187#discussion_r3386423297
##########
paimon-python/pypaimon/read/reader/data_file_batch_reader.py:
##########
@@ -57,55 +59,99 @@ def __init__(self, format_reader: RecordBatchReader,
index_mapping: List[int], p
self.file_io = file_io
# Per-file field-id normalization: map the physically-read columns
# (the file's own field order/names) onto the latest read target by
- # field id, padding missing ids with NULL. ``None`` when there is no
- # evolution to reconcile (identity) -- the common path stays zero-copy.
- self._normalize_positions, self._normalize_names = \
- self._build_normalize_plan(file_data_fields, target_data_fields)
+ # field id, padding missing ids with NULL and recursing into nested
+ # ROW / ARRAY<ROW> / MAP<.,ROW> sub-fields the same way. ``None`` when
+ # there is no evolution to reconcile -- the common path stays
zero-copy.
+ self._normalize_plan = self._build_normalize_plan(file_data_fields,
target_data_fields)
Review Comment:
> The new nested field-id normalization is skipped for dotted nested
projections. SplitRead.file_reader_supplier passes file_data_fields=None and
target_data_fields=None whenever has_nested is true, so
with_projection(['mv.renamed_leaf']) still reads by the new physical name from
old files. I reproduced rename mv.s -> ss followed by projection ['id',
'mv.ss']: the old row returned mv_ss=None instead of 'a'. A nested type change
is worse: projecting the evolved leaf kept old batches as int32 and new batches
as int64, causing pyarrow.lib.ArrowInvalid during concatenation.
Fixed in b4d46ecc3.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]