yordan-pavlov commented on issue #1111: URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003446768
I have been able to reproduce the issue where `RleDecoder` returns more keys than values (as explained by @tustvold above) by adding a test very similar to the existing `test_arrow_array_reader_string` but using dictionary encoding instead of plain. Next, I will looking for a short-term fix. Here is some sample output from the test: running 1 test page num_values: 100, values.len(): 33 page num_values: 100, values.len(): 38 VariableLenPlainDecoder::new, num_values: 9 ---------- reading a batch of 50 values ---------- VariableLenDictionaryDecoder::new, num_values: 100 VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 14 VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, self.num_values: 86 **// ok so far, 33 actual values - 14 values read = 19 values still left in first page** ---------- reading a batch of 100 values ---------- VariableLenPlainDecoder::new, num_values: 10 VariableLenDictionaryDecoder::new, num_values: 100 VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 86, num_values: 37 VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 26, self.num_values: 0 **// this is a problem - only 19 values were left in the first page, but 26 values have been read** VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, num_values: 11 VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, self.num_values: 0 VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 100, num_values: 11 VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, self.num_values: 89 thread 'arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string' panicked at 'assertion failed: `(left == right)` left: `"H"`, right: `"He"`', parquet\src\arrow\arrow_array_reader.rs:1745:17 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org