josephperez3 opened a new pull request, #15747: URL: https://github.com/apache/iceberg/pull/15747
Closes #9497 ## Problem PR #3774 added `visitTwoLevelList()` to correctly detect 2-level lists during schema traversal, but the `list()` methods in the reader builders were not updated. `visitTwoLevelList()` does not push the repeated field name onto the `fieldNames` stack, so `currentPath()` is one segment short. This produces `defLevel=0` and `repLevel=-1` instead of the correct values, causing `RepeatedReader` to loop indefinitely and crash with `ParquetDecodingException: Reading past RLE/BitPacking stream`. ## Fix In each `list()` method, detect 2-level lists via `isOldListElementType()` and compute def/rep levels from the element path instead of `currentPath()`. We can skip the `OptionReader` wrapping since elements are always non-null by nature of two level encoding. Fixed for both Flink and Spark read paths. ## Testing New tests in `TestSparkParquetReader` and `TestFlinkParquetReader`: write a 2-level Parquet file which includes some empty lists and verifies each row. Existing tests all pass, and without the fix, the new tests crash: ``` ParquetDecodingException: Can't read value in column [names, names_tuple] repeated binary names_tuple (STRING) at value 7 out of 7 in current page. repetition level: 0, definition level: 2 Caused by: could not read bytes at offset 40 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
