Re: [PR] Support ARRAY type with ORC-style flattened columnar storage [paimon-mosaic]

via GitHub Fri, 05 Jun 2026 17:33:10 -0700


JingsongLi commented on PR #52:
URL: https://github.com/apache/paimon-mosaic/pull/52#issuecomment-4636613792


   I found several correctness issues in the ARRAY implementation that should 
be fixed before merge:
   
   1. ARRAY child columns are corrupted when the child column uses dictionary 
encoding.
   
   `BucketWriter::write_dict_bit_packed` still iterates over `self.num_rows`, 
but for ARRAY child columns the physical row count is the number of elements, 
not the number of parent rows. A repro is 10 parent rows where each row 
contains 20 repeated `int32` elements such as `j % 3`: the child column chooses 
DICT, only the first 10 dictionary indexes are written, and the remaining 
decoded elements come back as dictionary index 0. Please iterate over 
`self.col_num_rows(col)` and use the child column's own null bitmap row count.
   
   2. Multiple top-level ARRAY columns in the same bucket can be reassembled 
with the wrong lengths column.
   
   `reassemble_list_columns` treats any child with `phys_idx > num_primary` as 
nested and uses `phys_idx - 1` as its lengths column. That is also true for the 
first child of the second top-level ARRAY column in a bucket, so `arr_b` can 
accidentally use `arr_a`'s child values as lengths. With `arr_a: list<int32>` 
and `arr_b: list<int64>` in one bucket, reading can panic while constructing a 
`ListArray` because the offsets exceed the values length. The fix should track 
the lengths physical index explicitly, or otherwise distinguish first-level 
children from nested children per parent.
   
   3. ARRAYs with null parent rows lose data for several supported element 
types.
   
   `validate_data_type` allows `List<Date32>`, `List<Time32>`, 
`List<Decimal128>`, `List<Timestamp>`, etc., and the normal write path supports 
those element arrays. But when a parent list has null rows, 
`flatten_list_values` calls `take_array`, whose fallback for unhandled types is 
`array.slice(0, 0)`. For example, `List<Date32>` with a null parent row that 
owns child slots reassembles with offsets pointing past an empty values array. 
Please either use Arrow's generic take implementation or handle every element 
type accepted by `validate_data_type`.
   
   4. The on-disk format changed while the footer version stayed at 1.
   
   This PR adds new bucket payload headers and a new schema type byte for 
ARRAY, but `VERSION` is still 1. Older v1 readers will try to parse new v1 
files instead of rejecting them cleanly. Please bump the format version, or 
keep the v1 payload layout backward-compatible and gate the new layout behind a 
new version.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Support ARRAY type with ORC-style flattened columnar storage [paimon-mosaic]

Reply via email to