tustvold edited a comment on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003119362


   So adding a print statement to `VariableLenDictionaryDecoder::new` it is 
being created twice with `num_values` of 3, i.e. the number of rows in the row 
group. 
   
   This is the `num_values` field from `Page`, which confusingly is the number 
of values **including** nulls. This value is then used to determine how many 
values to read from `RleDecoder` for this page. 
   
   Now a somewhat strange quirk of the [hybrid 
encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
 is packed "runs" are **always** multiples of 8 in length. This means if the 
final run of a page is packed encoded, as opposed to RLE, it will zero-padded 
to length. Unfortunately the parquet designers opted to not store the actual 
length for a packed run, but the length / 8. This means the length of the final 
packed run of a page is not actually knowable...
   
   This is where the issue arises. `VariableLenDictionaryDecoder` thinks it has 
more actual values than it does, as it is being fed the `value_count` for the 
page which counts nulls which aren't encoded. This means it asks `RleDecoder` 
for more keys than should actually be present. As `RleDecoder` contains a 
zero-padded final run, it returns too many values, which has the effect of 
"shifting" the string values in the final result.
   
   The fix should be a case of making whatever calls 
`ValueDecoder::read_value_bytes` only request a number of values that the page 
should be expected to yield. This is what `ColumnReaderImpl` 
[handles](https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader.rs#L208)
 for the non-`ArrowArrayReader` `ArrayReader` implementations. I need to do 
some digging to see how feasible this is with the design of ArrowArrayReader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to