[
https://issues.apache.org/jira/browse/ARROW-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545332#comment-17545332
]
Benedikt edited comment on ARROW-13024 at 6/2/22 8:22 AM:
----------------------------------------------------------
I've just encountered the same issue (on arrow 7.0.0), and starting from
[~romankarlstetter]'s comment I've browsed the {{ByteStreamSplitDecoder}} code
a bit and I think the suggestion above points at the right place, but is not a
complete fix.
The relevant exception was added in [https://github.com/apache/arrow/pull/6781]
in order to fix an out-of-bounds buffer read found by fuzzing (this one
apparently: [https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=21453]).
However, it seems to me that this fix wasn't quite right in the first place. My
understanding is the following:
In {{{}SetData{}}}, the number of {{null}} values interleaved with the data is
not known, but {{num_values}} includes them. Thus, it is not possible to
validate {{num_values}} against {{len}} in this function yet. Any validation
must happen in the {{Decode}} and {{DecodeArrow}} methods (in fact, {{Decode}}
is safe via limiting the number of values returned, and {{DecodeArrow}} does
precisely this verification).
As suggested above, {{num_values_in_buffer_}} is not correctly initialized,
since the meaning of {{num_values}} is different in the presence of \{{null}}s,
and should be set as argued by [~romankarlstetter] such that it has the meaning
suggested by its name.
However, when changing {{num_values_in_buffer_}} in this way, the calculation
of {{const int num_decoded_previously = num_values_in_buffer_ - num_values_}}
in {{Decode}} and {{DecodeArrow}} becomes incorrect since {{num_values_}}
includes {{null}}s, but {{num_values_in_buffer_}} does not. To complete the bug
fix, it might be easiest to store {{num_decoded_previously}} explicitly in
{{{}ByteStreamSplitDecoder{}}}.
If this is done, {{Decode}} should calculate
{{values_to_decode = std::min(num_values_in_buffer_ - num_decoded_previously,
max_values)}}
instead of the current
{{{}values_to_decode = std::min(num_values_, max_values){}}}.
I can't really tell from the bug report whether this already accounts for the
issues found by the fuzzer.
was (Author: JIRAUSER290350):
I've just encountered the same issue (on arrow 7.0.0), and starting from
[~romankarlstetter]'s comment I've browsed the {{ByteStreamSplitDecoder}} code
a bit and I think the suggestion above points at the right place, but is not a
complete fix.
The relevant exception was added in [https://github.com/apache/arrow/pull/6781]
in order to fix an out-of-bounds buffer read found by fuzzing. However, it
seems to me that this fix wasn't quite right in the first place. My
understanding is the following:
In {{SetData}}, the number of {{null}} values interleaved with the data is not
known, but {{num_values}} includes them. Thus, it is not possible to validate
{{num_values}} against {{len}} in this function yet. Any validation must happen
in the {{Decode}} and {{DecodeArrow}} methods (in fact, {{Decode}} is safe via
limiting the number of values returned, and {{DecodeArrow}} does precisely this
verification).
As suggested above, {{num_values_in_buffer_}} is not correctly initialized,
since the meaning of {{num_values}} is different in the presence of {{null}}s,
and should be set as argued by [~romankarlstetter] such that it has the meaning
suggested by its name.
However, when changing {{num_values_in_buffer_}} in this way, the calculation
of {{const int num_decoded_previously = num_values_in_buffer_ - num_values_}}
in {{Decode}} and {{DecodeArrow}} becomes incorrect since {{num_values_}}
includes {{null}}s, but {{num_values_in_buffer_}} does not. To complete the bug
fix, it might be easiest to store {{num_decoded_previously}} explicitly in
{{ByteStreamSplitDecoder}}.
If this is done, {{Decode}} should calculate
{{values_to_decode = std::min(num_values_in_buffer_ - num_decoded_previously,
max_values)}}
instead of the current
{{values_to_decode = std::min(num_values_, max_values)}}.
I can't really tell from the bug report whether this already accounts for the
issues found by the fuzzer.
> [C++][Parquet] Decoding byte stream split encoded columns fails when it has
> null values
> ---------------------------------------------------------------------------------------
>
> Key: ARROW-13024
> URL: https://issues.apache.org/jira/browse/ARROW-13024
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet
> Affects Versions: 2.0.0, 3.0.0, 4.0.0
> Reporter: Roman Karlstetter
> Priority: Major
>
> Reading from a parquet file fails with the following error
> {{Data size too small for number of values (corrupted file?)}}.
> This happens for the case when there is a {{BYTE_STREAM_SPLIT}}-encoded
> column which has less values stored than number of rows, which is the case
> when the column has null values (definition levels are present).
> The problematic part is the condition checked in
> {{ByteStreamSplitDecoder<DType>::SetData}}, which raises the error if the
> number of values does not match the size of the data array.
> I'm unsure whether I have enough experience with the internals of the
> encoding/decoding part of this implementation to fix this issue, but my
> suggestion would be to initialize {{num_values_in_buffer_}} with
> {{len/static_cast<int64_t>(sizeof(T))}}.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)