[
https://issues.apache.org/jira/browse/ARROW-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924566#comment-16924566
]
Wes McKinney commented on ARROW-3933:
-------------------------------------
As step one, I have it returning a reasonable error message instead of
segfaulting:
{code}
pyarrow.lib.ArrowIOError: Parquet struct decoding error. Expected to decode
1777 definition levels from child field "bytes: binary not null" in parent "gs:
struct<decompLen: int32, bytes: binary not null> not null" but was only able to
decode 0
In ../src/parquet/arrow/reader.cc, line 561, code:
GetDefLevels(&def_levels_data, &def_levels_length)
In ../src/parquet/arrow/reader.cc, line 659, code:
DefLevelsToNullArray(&null_bitmap, &null_count)
In ../src/parquet/arrow/reader.cc, line 795, code: final_status
{code}
Note that it won't be possible to read this file anyway because it contains
repeated structs (see ARROW-1644)
https://gist.github.com/wesm/fefdfc74bd5acffb92a6cbd3ec6e3c20
> [Python] Segfault reading Parquet files from GNOMAD
> ---------------------------------------------------
>
> Key: ARROW-3933
> URL: https://issues.apache.org/jira/browse/ARROW-3933
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Environment: Ubuntu 18.04 or Mac OS X
> Reporter: David Konerding
> Assignee: Wes McKinney
> Priority: Minor
> Labels: parquet
> Fix For: 0.15.0
>
> Attachments:
> part-r-00000-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
>
>
> I am getting segfault trying to run a basic program Ubuntu 18.04 VM (AWS).
> Error also occurs out of box on Mac OS X.
> $ sudo snap install --classic google-cloud-sdk
> $ gsutil cp
> gs://gnomad-public/release/2.0.2/vds/exomes/gnomad.exomes.r2.0.2.sites.vds/rdd.parquet/part-r-00000-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet
> .
> $ conda install pyarrow
> $ python test.py
> Segmentation fault (core dumped)
> test.py:
> import pyarrow.parquet as pq
> path = "part-r-00000-31fcf9bd-682f-4c20-bbe5-b0bd08699104.snappy.parquet"
> pq.read_table(path)
> gdb output:
> Thread 3 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffdf199700 (LWP 13703)]
> 0x00007fffdfc2a470 in parquet::arrow::StructImpl::GetDefLevels(short const**,
> unsigned long*) () from
> /home/ubuntu/miniconda2/lib/python2.7/site-packages/pyarrow/../../../libparquet.so.11
> I tested fastparquet, it reads the file just fine.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)