yordan-pavlov commented on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003631709


   UPDATE: for the short-term fix, the only option I can think of is (when def 
levels are present) to count the number of actual values here 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L393
 before creating the value reader and using this instead of num_values.
   
   This then makes the new test (using dictionary encoded pages) pass - notice 
how in the test output below the value of num_values in the 
`VariableLenDictionaryDecoder` is the actual number of values instead of 
including null-values:
   
   running 1 test
   page num_values: 100, values.len(): 25
   page num_values: 100, values.len(): 31
   VariableLenPlainDecoder::new, num_values: 10
   ---------- reading a batch of 50 values ----------
   VariableLenDictionaryDecoder::new, num_values: 25
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 25, 
num_values: 11
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 11, 
self.num_values: 14 
   ---------- reading a batch of 100 values ----------
   VariableLenPlainDecoder::new, num_values: 10
   VariableLenDictionaryDecoder::new, num_values: 31
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, 
num_values: 31
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, 
self.num_values: 0  
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 0, 
num_values: 17 
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 0, 
self.num_values: 0   
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 31, 
num_values: 17
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 17, 
self.num_values: 14 
   ---------- reading a batch of 100 values ----------
   VariableLenDictionaryDecoder::read_value_bytes - begin, self.num_values: 14, 
num_values: 14
   VariableLenDictionaryDecoder::read_value_bytes - end, values_read: 14, 
self.num_values: 0
   test arrow::arrow_array_reader::tests::test_arrow_array_reader_dict_string 
... ok
   
   test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 471 filtered 
out; finished in 0.01s
   
   
   Tomorrow I will be checking the impact on performance and possibly create a 
pull request for the new test plus short-term fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to