[jira] [Created] (ARROW-2462) [C++] Segfault when writing a parquet table containing a dictionary column from Record Batch Stream

Matt Topol (JIRA) Sun, 15 Apr 2018 12:32:13 -0700

Matt Topol created ARROW-2462:
---------------------------------

             Summary: [C++] Segfault when writing a parquet table containing a 
dictionary column from Record Batch Stream
                 Key: ARROW-2462
                 URL: https://issues.apache.org/jira/browse/ARROW-2462
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.9.1
            Reporter: Matt Topol



Discovered this through using pyarrow and dealing with RecordBatch Streams and 
parquet. The issue can be replicated as follows:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

# create record batch with 1 dictionary column
indices = pa.array([1,0,1,1,0])
dictionary = pa.array(['Foo', 'Bar'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
rb = pa.RecordBatch.from_arrays( [ dict_array ], [ 'd0' ] )

# write out using RecordBatchStreamWriter
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, rb.schema)
writer.write_batch(rb)
writer.close()
buf = sink.get_result()

# read in and try to write parquet table
reader = pa.open_stream(buf)
tbl = reader.read_all()
pq.write_table(tbl, 'dict_table.parquet') # SEGFAULTS
{code}

When writing record batch streams, if there are no nulls in an array, Arrow 
will put a placeholder nullptr instead of putting the full bitmap of 1s, when 
deserializing that stream, the bitmap for the nulls isn't populated and is left 
to being a nullptr. When attempting to write this table via pyarrow.parquet, 
you end up 
[here|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L963]
  in the parquet writer code which attempts to Cast the dictionary to a 
non-dictionary representation. Since the null count isn't checked before 
creating a BitmapReader, the BitmapReader is constructed with a nullptr for the 
bitmap_data, but a non-zero length which then segfaults in the constructor 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit-util.h#L415]
 because bitmap is null.

So a simple check of the null count before constructing the BitmapReader avoids 
the segfault.

Already filed [PR 1896|https://github.com/apache/arrow/pull/1896]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2462) [C++] Segfault when writing a parquet table containing a dictionary column from Record Batch Stream

Reply via email to