[GitHub] [arrow-rs] tustvold edited a comment on issue #1111: ArrowArrayReader Incorrect Data

GitBox Thu, 30 Dec 2021 09:34:56 -0800


tustvold edited a comment on issue #1111:
URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003119362

So adding a print statement to `VariableLenDictionaryDecoder::new` it is
being created twice with `num_values` of 3, i.e. the number of rows in the row
group.

This is the `num_values` field from `Page`, which confusingly is the number
of values **including** nulls. This value is then used to determine how many
values to read from `RleDecoder` for this page.

Now a somewhat strange quirk of the [hybrid
encoding](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
is packed "runs" are **always** multiples of 8 in length. This means if the
final run of a page is packed encoded, as opposed to RLE, it will zero-padded
to length. Unfortunately the parquet designers opted to not store the actual
length for a packed run, but the length / 8. This means the length of the final
packed run of a page is not actually knowable...

This is where the issue arises. `VariableLenDictionaryDecoder` thinks it has
more actual values than it does, as it is being fed the `value_count` for the
page which counts nulls which aren't encoded. This means it asks `RleDecoder`
for more keys than should actually be present. As `RleDecoder` contains a
zero-padded final run, it returns too many values, which has the effect of
"shifting" the string values in the final result.

The fix should be a case of making whatever calls
`ValueDecoder::read_value_bytes` only request a number of values that the page
should be expected to yield. This is what `ColumnReaderImpl`
[handles](https://github.com/apache/arrow-rs/blob/master/parquet/src/column/reader.rs#L208)
for the non-`ArrowArrayReader` `ArrayReader` implementations. I need to do
some digging to see how feasible this is with the design of ArrowArrayReader.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold edited a comment on issue #1111: ArrowArrayReader Incorrect Data

Reply via email to