[ 
https://issues.apache.org/jira/browse/ARROW-2462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444152#comment-16444152
 ] 

ASF GitHub Bot commented on ARROW-2462:
---------------------------------------

pitrou commented on issue #1896: ARROW-2462: [C++] Fix Segfault in 
UnpackBinaryDictionary
URL: https://github.com/apache/arrow/pull/1896#issuecomment-382764445
 
 
   @zeroshade Perhaps you are not using the right clang version. You need 
`clang-format-5.0` installed for `make format` to produce output exactly 
compatible with the CI expectations.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing a parquet table containing a dictionary column 
> from Record Batch Stream
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2462
>                 URL: https://issues.apache.org/jira/browse/ARROW-2462
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.9.1
>            Reporter: Matt Topol
>            Priority: Major
>              Labels: pull-request-available
>
> Discovered this through using pyarrow and dealing with RecordBatch Streams 
> and parquet. The issue can be replicated as follows:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> # create record batch with 1 dictionary column
> indices = pa.array([1,0,1,1,0])
> dictionary = pa.array(['Foo', 'Bar'])
> dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
> rb = pa.RecordBatch.from_arrays( [ dict_array ], [ 'd0' ] )
> # write out using RecordBatchStreamWriter
> sink = pa.BufferOutputStream()
> writer = pa.RecordBatchStreamWriter(sink, rb.schema)
> writer.write_batch(rb)
> writer.close()
> buf = sink.get_result()
> # read in and try to write parquet table
> reader = pa.open_stream(buf)
> tbl = reader.read_all()
> pq.write_table(tbl, 'dict_table.parquet') # SEGFAULTS
> {code}
> When writing record batch streams, if there are no nulls in an array, Arrow 
> will put a placeholder nullptr instead of putting the full bitmap of 1s, when 
> deserializing that stream, the bitmap for the nulls isn't populated and is 
> left to being a nullptr. When attempting to write this table via 
> pyarrow.parquet, you end up 
> [here|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L963]
>   in the parquet writer code which attempts to Cast the dictionary to a 
> non-dictionary representation. Since the null count isn't checked before 
> creating a BitmapReader, the BitmapReader is constructed with a nullptr for 
> the bitmap_data, but a non-zero length which then segfaults in the 
> constructor 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit-util.h#L415]
>  because bitmap is null.
> So a simple check of the null count before constructing the BitmapReader 
> avoids the segfault.
> Already filed [PR 1896|https://github.com/apache/arrow/pull/1896]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to