[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

Wes McKinney (JIRA) Sun, 02 Jul 2017 15:16:16 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071805#comment-16071805
 ]


Wes McKinney commented on ARROW-1167:
-------------------------------------

OK, I believe the root cause is that one of the columns in this dataset has 
over 2GB of string data in it, which is causing an undetected overflow in the 
int32 offsets in the underlying `BinaryArray` object. So there's a bunch of 
things that need to happen:

* Detecting int32 overflow in BinaryBuilder (so constructing a malformed 
BinaryArray like this isn't possible)
* Making sure such overflows are raised properly out of Table.from_pandas
* Providing for chunked table construction in {{Table.from_pandas]} (which will 
help you fix this problem)

cc [~xhochy]

> Writing pyarrow Table to Parquet core dumps
> -------------------------------------------
>
>                 Key: ARROW-1167
>                 URL: https://issues.apache.org/jira/browse/ARROW-1167
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x00007fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x00007fbaa5c0ce97 in 
> parquet::PlainEncoder<parquet::DataType<(parquet::Type::type)6> 
> >::Put(parquet::ByteArray const*, int) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x00007fbaa5c18855 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x00007fbaa5c189d5 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x00007fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr<arrow::Array> 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x00007fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x00007fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x00007fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x00007fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x00000000004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x0000000000529885 in do_call (nk=<optimized out>, na=<optimized out>, 
> pp_stack=0x7ffe6510a6c0, func=<optimized out>) at ../Python/ceval.c:4933
> #12 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510a8d0, func=<optimized out>) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510aae0, func=<optimized out>) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510ac10, func=<optimized out>) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510ad40, func=<optimized out>) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x000000000052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, 
> locals=<optimized out>) at ../Python/ceval.c:777
> #31 0x00000000005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x00000000005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x00000000005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x000000000063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
> L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
> #35 Py_Main () at ../Modules/main.c:768
> #36 0x00000000004cfe41 in main () at ../Programs/python.c:65
> #37 0x00007fbadf0db830 in __libc_start_main (main=0x4cfd60 <main>, argc=2, 
> argv=0x7ffe6510b1c8, init=<optimized out>, fini=<optimized out>, 
> rtld_fini=<optimized out>, stack_end=0x7ffe6510b1b8)
>     at ../csu/libc-start.c:291
> #38 0x00000000005d5f29 in _start ()
> {code}
> This is occurring in a pretty vanilla call to `pq.write_table(table, 
> output)`. Before the crash, I'm able to print out the table's schema and it 
> looks a little odd (all columns are explicitly specified in 
> {{pandas.read_csv()}} to be strings...
> {code}
> _id: string
> ref_id: string
> ref_no: string
> stage: string
> stage2_ref_id: string
> org_id: string
> classification: string
> solicitation_no: string
> notice_type: string
> business_category: string
> procurement_mode: string
> funding_instrument: string
> funding_source: string
> approved_budget: string
> publish_date: string
> closing_date: string
> contract_duration: string
> calendar_type: string
> trade_agreement: string
> pre_bid_date: string
> pre_bid_venue: string
> procuring_entity_org_id: string
> procuring_entity_org: string
> client_agency_org_id: string
> client_agency_org: string
> contact_person: string
> contact_person_address: string
> tender_title: string
> description: string
> other_info: string
> reason: string
> created_by: string
> creation_date: string
> modified_date: string
> special_instruction: string
> collection_contact: string
> tender_status: string
> collection_point: string
> date_available: string
> serialid: string
> __index_level_0__: int64
> -- metadata --
> pandas: {"index_columns": ["__index_level_0__"], "columns": [{"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": "_id"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "ref_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": 
> null, "name": "ref_no"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "stage"}, {"pandas_type": "mixed", "numpy_type": 
> "object", "metadata": null, "name": "stage2_ref_id"}, {"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": "org_id"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "classification"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "solicitation_no"}, {"pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null, "name": "notice_type"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "business_category"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "procurement_mode"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "funding_instrument"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "funding_source"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "approved_budget"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "publish_date"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "closing_date"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "contract_duration"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "calendar_type"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "trade_agreement"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "pre_bid_date"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "pre_bid_venue"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "procuring_entity_org_id"}, {"pandas_type": "unicode", "numpy_type": 
> "object", "metadata": null, "name": "procuring_entity_org"}, {"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": 
> "client_agency_org_id"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "client_agency_org"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "contact_person"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "contact_person_address"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "tender_title"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "description"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "other_info"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
> null, "name": "reason"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "created_by"}, {"pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null, "name": "creation_date"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "modified_date"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "special_instruction"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "collection_contact"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "tender_status"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "collection_point"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "date_available"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "serialid"}, {"pandas_type": "int64", "numpy_type": "int64", "metadata": 
> null, "name": "__index_level_0__"}], "pandas_version": "0.19.2"}
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

Reply via email to