yordan-pavlov opened a new pull request #1130: URL: https://github.com/apache/arrow-rs/pull/1130
# Which issue does this PR close? Closes #1111. # Rationale for this change As explained in #1111 `RleDecoder` as used in `VariableLenDictionaryDecoder` as part of the implementation of `ArrowArrayReader`, incorrectly returns more keys than are actually available while at the same time, when the page contains NULLs `VariableLenDictionaryDecoder` is also requesting more keys than available because `num_values` is inclusive of NULLs. This then results in incorrectly decoding a dictionary-encoded page which also contains NULLs and returning more values than necessary. # What changes are included in this PR? This PR contains: * a fix where the actual number of values (excluding NULLs) is calculated from def levels (if present) and is used (instead of `num_values` from the data page) when creating the value decoder, so that it knows how many values are actually available. This is then used in existing code in `VariableLenDictionaryDecoder` to limit how many keys are requested from the nested `RleDecoder`. * a new test `test_arrow_array_reader_dict_enc_string` for `ArrowArrayReader` * a new test `test_complex_array_reader_dict_enc_string` for `ArrayReader` # Are there any user-facing changes? No @alamb @tustvold -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
