[ 
https://issues.apache.org/jira/browse/ARROW-8006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052221#comment-17052221
 ] 

Antoine Pitrou commented on ARROW-8006:
---------------------------------------

[~belzilep] Would you be able to test the attached Pull Request?

> [C++] Unsafe arrow dictionary recovered from parquet
> ----------------------------------------------------
>
>                 Key: ARROW-8006
>                 URL: https://issues.apache.org/jira/browse/ARROW-8006
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 0.15.1
>            Reporter: Pierre Belzile
>            Assignee: Antoine Pitrou
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When an arrow dictionary of values=strings and indices=intx is written to 
> parquet and recovered, the indices that correspond to null positions are not 
> written. This causes two problems:
>  * when transposing the dictionary, the code encounters indices that are out 
> of bounds with the existing dictionary. This does cause crashes.
>  * a potential security risk because it's unclear whether bytes can be read 
> back inadvertently.
> I traced using GDB and found that:
>  # My dictionary indices were decoded by RleDecoder::GetBatchSpaced. When the 
> valid bit is unset, that function increments "out" but does not set it. I 
> think it should write a 0. 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/rle_encoding.h#L396]
>  # The recovered data "out" array is written to the dictionary builder using 
> an AppendIndices which moves the memory as a bulk move without checking for 
> nulls. Hence we end-up with the indices buffer holding the "out" from above. 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670
>  
> |https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1670]When
>  transpose runs on this 
> ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util.cc#L406]),
>  it may attempt to access memory out of bounds.
> While is would be possible to fix "transpose" and other functions that 
> process dictionary indices (e.g. compare for sorting), it seems safer to 
> initialize to 0. Also that's the default behavior for the arrow dict builder 
> when appending one or more nulls.
> Incidentally the code recovers the dict with indices int32 instead of the 
> original int8 but I guess that this is covered by another activity.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to