albertlockett opened a new pull request, #8573: URL: https://github.com/apache/arrow-rs/pull/8573
# Which issue does this PR close? - Closes #8404 # Rationale for this change A regression was reported in issue #8404 which was introduced in https://github.com/apache/arrow-rs/pull/7585. This PR resolves the issue. # What changes are included in this PR? The root cause of the issue was that the behaviour of `ByteArrayDictionaryReader` is to return a new empty length array of values if the record reader has already been consumed. The problem was that the repetition and definition level buffers were not being advanced in this early return case. https://github.com/apache/arrow-rs/blob/521f219e308613811aeae11300bf7a7b0fb5ec29/parquet/src/arrow/array_reader/byte_array_dictionary.rs#L167-L183 The `StructArrayReader` reads the repetition and definition levels from the first child to determine the nullability of the struct array. When we returned the empty values buffer for the child, without advancing the repetition and definition buffers, the `StructArrayReader` a length mismatch between the empty child array and the non-empty nullability bitmask, and this produces the error. https://github.com/apache/arrow-rs/blob/521f219e308613811aeae11300bf7a7b0fb5ec29/parquet/src/arrow/array_reader/struct_array.rs#L137-L170 The fix is simple, always have `ByteArrayDictionaryReader` advance the repetition and definition level buffers when `consume_next_batch` is called. # Are these changes tested? Yes, a new unit test was added `test_read_nullable_structs_with_binary_dict_as_first_child_column`, which before the changes introduced in this PR would replicate the issue. # Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
