yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003446768
I have been able to reproduce the issue where `RleDecoder` returns more keys
than values (as explained by @tustvold above) by adding a test very similar to
the existing `test_arrow_array_reader_string` but using dictionary encoding
instead of plain. Next, I will looking for a short-term fix.
Here is some sample output from the test:
running 1 test
page num_values: 100, values.len(): 33
page num_values: 100, values.len(): 38
VariableLenPlainDecoder::new, num_values: 9
---------- reading a batch of 50 values ----------
VariableLenDictionaryDecoder::new, num_values: 100
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values:
100, num_values: 14
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14,
self.num_values: 86
**// ok so far, 33 actual values - 14 values read = 19 values still left in
first page**
---------- reading a batch of 100 values ----------
VariableLenPlainDecoder::new, num_values: 10
VariableLenDictionaryDecoder::new, num_values: 100
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 86,
num_values: 37
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 26,
self.num_values: 0
**// this is a problem - only 19 values were left in the first page, but 26
values have been read**
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0,
num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0,
self.num_values: 0
VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values:
100, num_values: 11
VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11,
self.num_values: 89
thread
'arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string'
panicked at 'assertion failed: `(left == right)`
left: `"H"`,
right: `"He"`', parquet\src\arrow\arrow_array_reader.rs:1745:17
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]