[ 
https://issues.apache.org/jira/browse/ARROW-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545332#comment-17545332
 ] 

Benedikt edited comment on ARROW-13024 at 6/2/22 8:22 AM:
----------------------------------------------------------

I've just encountered the same issue (on arrow 7.0.0), and starting from 
[~romankarlstetter]'s comment I've browsed the {{ByteStreamSplitDecoder}} code 
a bit and I think the suggestion above points at the right place, but is not a 
complete fix.

The relevant exception was added in [https://github.com/apache/arrow/pull/6781] 
in order to fix an out-of-bounds buffer read found by fuzzing (this one 
apparently: [https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=21453]). 
However, it seems to me that this fix wasn't quite right in the first place. My 
understanding is the following:

In {{{}SetData{}}}, the number of {{null}} values interleaved with the data is 
not known, but {{num_values}} includes them. Thus, it is not possible to 
validate {{num_values}} against {{len}} in this function yet. Any validation 
must happen in the {{Decode}} and {{DecodeArrow}} methods (in fact, {{Decode}} 
is safe via limiting the number of values returned, and {{DecodeArrow}} does 
precisely this verification).

As suggested above, {{num_values_in_buffer_}} is not correctly initialized, 
since the meaning of {{num_values}} is different in the presence of \{{null}}s, 
and should be set as argued by [~romankarlstetter] such that it has the meaning 
suggested by its name.

However, when changing {{num_values_in_buffer_}} in this way, the calculation 
of {{const int num_decoded_previously = num_values_in_buffer_ - num_values_}} 
in {{Decode}} and {{DecodeArrow}} becomes incorrect since {{num_values_}} 
includes {{null}}s, but {{num_values_in_buffer_}} does not. To complete the bug 
fix, it might be easiest to store {{num_decoded_previously}} explicitly in 
{{{}ByteStreamSplitDecoder{}}}.

If this is done, {{Decode}} should calculate

{{values_to_decode = std::min(num_values_in_buffer_ - num_decoded_previously, 
max_values)}}

instead of the current

{{{}values_to_decode = std::min(num_values_, max_values){}}}.

I can't really tell from the bug report whether this already accounts for the 
issues found by the fuzzer.


was (Author: JIRAUSER290350):
I've just encountered the same issue (on arrow 7.0.0), and starting from 
[~romankarlstetter]'s comment I've browsed the {{ByteStreamSplitDecoder}} code 
a bit and I think the suggestion above points at the right place, but is not a 
complete fix.

The relevant exception was added in [https://github.com/apache/arrow/pull/6781] 
in order to fix an out-of-bounds buffer read found by fuzzing. However, it 
seems to me that this fix wasn't quite right in the first place. My 
understanding is the following:

In {{SetData}}, the number of {{null}} values interleaved with the data is not 
known, but {{num_values}} includes them. Thus, it is not possible to validate 
{{num_values}} against {{len}} in this function yet. Any validation must happen 
in the {{Decode}} and {{DecodeArrow}} methods (in fact, {{Decode}} is safe via 
limiting the number of values returned, and {{DecodeArrow}} does precisely this 
verification).

As suggested above, {{num_values_in_buffer_}} is not correctly initialized, 
since the meaning of {{num_values}} is different in the presence of {{null}}s, 
and should be set as argued by [~romankarlstetter] such that it has the meaning 
suggested by its name.

However, when changing {{num_values_in_buffer_}} in this way, the calculation 
of {{const int num_decoded_previously = num_values_in_buffer_ - num_values_}} 
in {{Decode}} and {{DecodeArrow}} becomes incorrect since {{num_values_}} 
includes {{null}}s, but {{num_values_in_buffer_}} does not. To complete the bug 
fix, it might be easiest to store {{num_decoded_previously}} explicitly in 
{{ByteStreamSplitDecoder}}.

If this is done, {{Decode}} should calculate

{{values_to_decode = std::min(num_values_in_buffer_ - num_decoded_previously, 
max_values)}}

instead of the current

{{values_to_decode = std::min(num_values_, max_values)}}.

I can't really tell from the bug report whether this already accounts for the 
issues found by the fuzzer.

> [C++][Parquet] Decoding byte stream split encoded columns fails when it has 
> null values
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-13024
>                 URL: https://issues.apache.org/jira/browse/ARROW-13024
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>    Affects Versions: 2.0.0, 3.0.0, 4.0.0
>            Reporter: Roman Karlstetter
>            Priority: Major
>
> Reading from a parquet file fails with the following error
> {{Data size too small for number of values (corrupted file?)}}.
> This happens for the case when there is a {{BYTE_STREAM_SPLIT}}-encoded 
> column which has less values stored than number of rows, which is the case 
> when the column has null values (definition levels are present).
> The problematic part is the condition checked in 
> {{ByteStreamSplitDecoder<DType>::SetData}}, which raises the error if the 
> number of values does not match the size of the data array.
> I'm unsure whether I have enough experience with the internals of the 
> encoding/decoding part of this implementation to fix this issue, but my 
> suggestion would be to initialize {{num_values_in_buffer_}} with 
> {{len/static_cast<int64_t>(sizeof(T))}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to