Matt Topol created ARROW-2462:
---------------------------------
Summary: [C++] Segfault when writing a parquet table containing a
dictionary column from Record Batch Stream
Key: ARROW-2462
URL: https://issues.apache.org/jira/browse/ARROW-2462
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Affects Versions: 0.9.1
Reporter: Matt Topol
Discovered this through using pyarrow and dealing with RecordBatch Streams and
parquet. The issue can be replicated as follows:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
# create record batch with 1 dictionary column
indices = pa.array([1,0,1,1,0])
dictionary = pa.array(['Foo', 'Bar'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
rb = pa.RecordBatch.from_arrays( [ dict_array ], [ 'd0' ] )
# write out using RecordBatchStreamWriter
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, rb.schema)
writer.write_batch(rb)
writer.close()
buf = sink.get_result()
# read in and try to write parquet table
reader = pa.open_stream(buf)
tbl = reader.read_all()
pq.write_table(tbl, 'dict_table.parquet') # SEGFAULTS
{code}
When writing record batch streams, if there are no nulls in an array, Arrow
will put a placeholder nullptr instead of putting the full bitmap of 1s, when
deserializing that stream, the bitmap for the nulls isn't populated and is left
to being a nullptr. When attempting to write this table via pyarrow.parquet,
you end up
[here|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L963]
in the parquet writer code which attempts to Cast the dictionary to a
non-dictionary representation. Since the null count isn't checked before
creating a BitmapReader, the BitmapReader is constructed with a nullptr for the
bitmap_data, but a non-zero length which then segfaults in the constructor
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit-util.h#L415]
because bitmap is null.
So a simple check of the null count before constructing the BitmapReader avoids
the segfault.
Already filed [PR 1896|https://github.com/apache/arrow/pull/1896]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)