JingsongLi commented on PR #52: URL: https://github.com/apache/paimon-mosaic/pull/52#issuecomment-4636613792
I found several correctness issues in the ARRAY implementation that should be fixed before merge: 1. ARRAY child columns are corrupted when the child column uses dictionary encoding. `BucketWriter::write_dict_bit_packed` still iterates over `self.num_rows`, but for ARRAY child columns the physical row count is the number of elements, not the number of parent rows. A repro is 10 parent rows where each row contains 20 repeated `int32` elements such as `j % 3`: the child column chooses DICT, only the first 10 dictionary indexes are written, and the remaining decoded elements come back as dictionary index 0. Please iterate over `self.col_num_rows(col)` and use the child column's own null bitmap row count. 2. Multiple top-level ARRAY columns in the same bucket can be reassembled with the wrong lengths column. `reassemble_list_columns` treats any child with `phys_idx > num_primary` as nested and uses `phys_idx - 1` as its lengths column. That is also true for the first child of the second top-level ARRAY column in a bucket, so `arr_b` can accidentally use `arr_a`'s child values as lengths. With `arr_a: list<int32>` and `arr_b: list<int64>` in one bucket, reading can panic while constructing a `ListArray` because the offsets exceed the values length. The fix should track the lengths physical index explicitly, or otherwise distinguish first-level children from nested children per parent. 3. ARRAYs with null parent rows lose data for several supported element types. `validate_data_type` allows `List<Date32>`, `List<Time32>`, `List<Decimal128>`, `List<Timestamp>`, etc., and the normal write path supports those element arrays. But when a parent list has null rows, `flatten_list_values` calls `take_array`, whose fallback for unhandled types is `array.slice(0, 0)`. For example, `List<Date32>` with a null parent row that owns child slots reassembles with offsets pointing past an empty values array. Please either use Arrow's generic take implementation or handle every element type accepted by `validate_data_type`. 4. The on-disk format changed while the footer version stayed at 1. This PR adds new bucket payload headers and a new schema type byte for ARRAY, but `VERSION` is still 1. Older v1 readers will try to parse new v1 files instead of rejecting them cleanly. Please bump the format version, or keep the v1 payload layout backward-compatible and gate the new layout behind a new version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
