[
https://issues.apache.org/jira/browse/ARROW-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052221#comment-17052221
]
Antoine Pitrou commented on ARROW-8006:
---------------------------------------
[~belzilep] Would you be able to test the attached Pull Request?
> [C++] Unsafe arrow dictionary recovered from parquet
> ----------------------------------------------------
>
> Key: ARROW-8006
> URL: https://issues.apache.org/jira/browse/ARROW-8006
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 0.15.1
> Reporter: Pierre Belzile
> Assignee: Antoine Pitrou
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When an arrow dictionary of values=strings and indices=intx is written to
> parquet and recovered, the indices that correspond to null positions are not
> written. This causes two problems:
> * when transposing the dictionary, the code encounters indices that are out
> of bounds with the existing dictionary. This does cause crashes.
> * a potential security risk because it's unclear whether bytes can be read
> back inadvertently.
> I traced using GDB and found that:
> # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the
> valid bit is unset, that function increments "out" but does not set it. I
> think it should write a 0.
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
> # The recovered data "out" array is written to the dictionary builder using
> an AppendIndices which moves the memory as a bulk move without checking for
> nulls. Hence we end-up with the indices buffer holding the "out" from above.
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670
>
> |https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When
> transpose runs on this
> ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]),
> it may attempt to access memory out of bounds.
> While is would be possible to fix "transpose" and other functions that
> process dictionary indices (e.g. compare for sorting), it seems safer to
> initialize to 0. Also that's the default behavior for the arrow dict builder
> when appending one or more nulls.
> Incidentally the code recovers the dict with indices int32 instead of the
> original int8 but I guess that this is covered by another activity.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)