Pierre Belzile created ARROW-8006:
-------------------------------------
Summary: unsafe arrow dictionary recovered from parquet
Key: ARROW-8006
URL: https://issues.apache.org/jira/browse/ARROW-8006
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 0.15.1
Reporter: Pierre Belzile
When an arrow dictionary of values=strings and indices=intx is written to
parquet and recovered, the indices that correspond to null positions are not
written. This causes two problems:
* when transposing the dictionary, the code encounters indices that are out of
bounds with the existing dictionary. This does cause crashes.
* a potential security risk because it's unclear whether bytes can be read
back inadvertently.
I traced using GDB and found that:
# My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the
valid bit is unset, that function increments "out" but does not set it. I think
it should write a 0.
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
# The recovered data "out" array is written to the dictionary builder using an
AppendIndices which moves the memory without as a bulk move without checking
for nulls. Hence we end-up with the indices buffer holding the "out" from
above.
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670
|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When
transpose runs on this
([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]),
it may attempt to access memory out of bounds.
While is would be possible to fix "transpose" and other functions that process
dictionary indices (e.g. compare for sorting), it seems safer to initialize to
0. Also that's the default behavior for the arrow dict builder when appending
one or more nulls.
Incidentally the code recovers the dict with indices int32 instead of the
original int8 but I guess that this is covered by another activity.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)