[jira] [Updated] (ARROW-3792) [PARQUET] Segmentation fault when writing empty RecordBatches

Suvayu Ali (JIRA) Wed, 14 Nov 2018 10:12:09 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Suvayu Ali updated ARROW-3792:
------------------------------
    Description: 
h2. Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
are populated). The file I am working with spans upto ~63M rows. I decided to 
iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
{{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
incrementally. Something like this:

{code:python}
batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)
pqwriter.write_table(tbl, row_group_size=15000)
# same issue with pq.write_table(..)
{code}

I was getting a segmentation fault at the final step, I narrowed it down to a 
specific iteration. I noticed that iteration had empty batches; specifically, 
[0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the whole 
dataset is below:

{code:python}
[14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
18807, 18789, 14258, 0, 0]
{code}

On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
unfortunately I couldn't create a proper minimal example with synthetic data.

h2. Not quite minimal example

The data I am using is from the 1000 Genome project, which has been public for 
many years, so we can be reasonably sure the data is good. The following steps 
should help you replicate the issue.

# Download the data file (and index), about 330MB:
   {code:bash}
   $ wget 
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
   {code}
# Install the Cython library {{pysam}}, a thin wrapper around the reference 
implementation of the VCF file spec. You will need {{zlib}} headers, but that's 
probably not a problem :)
   {code:bash}
   $ pip3 install --user pysam
   {code}
# Now you can use the attached script to replicate the crash.

h2. Extra information

I have tried attaching gdb, the backtrace when the segfault occurs is shown 
below (maybe it helps, this is how I realised empty batches could be the 
reason).

{code}
(gdb) bt
#0  0x00007f3e7676d670 in 
parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
>::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) 
()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#1  0x00007f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
 arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#2  0x00007f3e7673a3d4 in parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#3  0x00007f3e7673df09 in 
parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#4  0x00007f3e7673c74d in 
parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#5  0x00007f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
const&, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#6  0x00007f3e731e3a51 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
_object*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
{code}

  was:
h2. Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
are populated). The file I am working with spans upto ~63M rows. I decided to 
iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
RecordBatch. I create 4 batches at a time, and write to a parquet file 
incrementally. Something like this:

{code:python}
batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)
pqwriter.write_table(tbl, row_group_size=15000)
# same issue with pq.write_table(..)
{code}

I was getting a segmentation fault at the final step, I narrowed it down to a 
specific iteration. I noticed that iteration had empty batches; specifically, 
[0, 0, 2876, 14423]. The number of rows for each RecordBatch for the whole 
dataset is below:

{code:python}
[14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
18807, 18789, 14258, 0, 0]
{code}

On excluding the empty RecordBatch-es, the segfault goes away, but 
unfortunately I couldn't create a proper minimal example with synthetic data.

h2. Not quite minimal example

The data I am using is from the 1000 Genome project, which has been public for 
many years, so we can be reasonably sure the data is good. The following steps 
should help you replicate the issue.

# Download the data file (and index), about 330MB:
   {code:bash}
   $ wget 
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
   {code}
# Install the Cython library pysam, a thin wrapper around the reference 
implementation of the VCF file spec. You will need zlib headers, but that's 
probably not a problem :)
   {code:bash}
   $ pip3 install --user pysam
   {code}
# Now you can use the attached script to replicate the crash.

h2. Extra information

I have tried attaching gdb, the backtrace when the segfault occurs is shown 
below (maybe it helps, this is how I realised empty batches could be the 
reason).

{code}
(gdb) bt
#0  0x00007f3e7676d670 in 
parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
>::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) 
()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#1  0x00007f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
 arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#2  0x00007f3e7673a3d4 in parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#3  0x00007f3e7673df09 in 
parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#4  0x00007f3e7673c74d in 
parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#5  0x00007f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
const&, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#6  0x00007f3e731e3a51 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
_object*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
{code}


> [PARQUET] Segmentation fault when writing empty RecordBatches
> -------------------------------------------------------------
>
>                 Key: ARROW-3792
>                 URL: https://issues.apache.org/jira/browse/ARROW-3792
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Format
>    Affects Versions: 0.11.1
>         Environment: Fedora 28, pyarrow installed with pip
> Fedora 29, pyarrow installed from conda-forge
>            Reporter: Suvayu Ali
>            Priority: Major
>              Labels: parquetWriter
>         Attachments: pq-bug.py
>
>
> h2. Background
> I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
> are populated). The file I am working with spans upto ~63M rows. I decided to 
> iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
> {{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
> incrementally. Something like this:
> {code:python}
> batches = [..]  # 4 batches
> tbl = pa.Table.from_batches(batches)
> pqwriter.write_table(tbl, row_group_size=15000)
> # same issue with pq.write_table(..)
> {code}
> I was getting a segmentation fault at the final step, I narrowed it down to a 
> specific iteration. I noticed that iteration had empty batches; specifically, 
> [0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the 
> whole dataset is below:
> {code:python}
> [14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
> 15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
> 14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
> 14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
> 13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
> 15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
> 15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
> 15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
> 15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
> 16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
> 15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
> 15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
> 18807, 18789, 14258, 0, 0]
> {code}
> On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
> unfortunately I couldn't create a proper minimal example with synthetic data.
> h2. Not quite minimal example
> The data I am using is from the 1000 Genome project, which has been public 
> for many years, so we can be reasonably sure the data is good. The following 
> steps should help you replicate the issue.
> # Download the data file (and index), about 330MB:
>    {code:bash}
>    $ wget 
> ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
>    {code}
> # Install the Cython library {{pysam}}, a thin wrapper around the reference 
> implementation of the VCF file spec. You will need {{zlib}} headers, but 
> that's probably not a problem :)
>    {code:bash}
>    $ pip3 install --user pysam
>    {code}
> # Now you can use the attached script to replicate the crash.
> h2. Extra information
> I have tried attaching gdb, the backtrace when the segfault occurs is shown 
> below (maybe it helps, this is how I realised empty batches could be the 
> reason).
> {code}
> (gdb) bt
> #0  0x00007f3e7676d670 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #1  0x00007f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
>  arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #2  0x00007f3e7673a3d4 in parquet::arrow::(anonymous 
> namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #3  0x00007f3e7673df09 in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
>  const&, long, long) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #4  0x00007f3e7673c74d in 
> parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr<arrow::ChunkedArray>
>  const&, long, long) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #5  0x00007f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
> #6  0x00007f3e731e3a51 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>    from 
> /home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3792) [PARQUET] Segmentation fault when writing empty RecordBatches

Reply via email to