yordan-pavlov commented on issue #1111: URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003224774
here is what I've found so far: * there is a test for plain-encoded strings which is working with null values and across pages here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1493 * `VariableLenPlainDecoder` gets used in the above test and it does work correctly, because although the value of `num_values` for the decoder does include NULLs, it stops reading from the page correctly because it checks that it doesn't read out of the values buffer ( `while self.position < data_len` ) here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L919 * what is missing is a test that exercises the `VariableLenDictionaryDecoder` * the `VariableLenDictionaryDecoder` relies on the `RleDecoder` to not read out of its buffer and to return 0 when no more values can be read here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1069 it's getting pretty late now, but tomorrow I will try to write the missing test (that doesn't rely on an external parquet file) to reproduce the issue with `VariableLenDictionaryDecoder` / `RleDecoder` and also think on a short-term fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
