Matt Topol created ARROW-2462:

             Summary: [C++] Segfault when writing a parquet table containing a 
dictionary column from Record Batch Stream
                 Key: ARROW-2462
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.9.1
            Reporter: Matt Topol

Discovered this through using pyarrow and dealing with RecordBatch Streams and 
parquet. The issue can be replicated as follows:

import pyarrow as pa
import pyarrow.parquet as pq

# create record batch with 1 dictionary column
indices = pa.array([1,0,1,1,0])
dictionary = pa.array(['Foo', 'Bar'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
rb = pa.RecordBatch.from_arrays( [ dict_array ], [ 'd0' ] )

# write out using RecordBatchStreamWriter
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, rb.schema)
buf = sink.get_result()

# read in and try to write parquet table
reader = pa.open_stream(buf)
tbl = reader.read_all()
pq.write_table(tbl, 'dict_table.parquet') # SEGFAULTS

When writing record batch streams, if there are no nulls in an array, Arrow 
will put a placeholder nullptr instead of putting the full bitmap of 1s, when 
deserializing that stream, the bitmap for the nulls isn't populated and is left 
to being a nullptr. When attempting to write this table via pyarrow.parquet, 
you end up 
  in the parquet writer code which attempts to Cast the dictionary to a 
non-dictionary representation. Since the null count isn't checked before 
creating a BitmapReader, the BitmapReader is constructed with a nullptr for the 
bitmap_data, but a non-zero length which then segfaults in the constructor 
 because bitmap is null.

So a simple check of the null count before constructing the BitmapReader avoids 
the segfault.

Already filed [PR 1896|]

This message was sent by Atlassian JIRA

Reply via email to