[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

Jeff Knupp (JIRA) Thu, 06 Jul 2017 09:22:15 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16076822#comment-16076822
 ]


Jeff Knupp commented on ARROW-1167:
-----------------------------------

So [~wesmckinn], pandas has the exact same bug (a bit easier to trigger) as 
reported here: https://github.com/pandas-dev/pandas/issues/16798. I tracked 
down where the allocation that triggers the issue is occurring and 
unsurprisingly it's when growing the buffer to accommodate the size of the 
data. I've confirmed that this, also, results in an integer overflow for the 
size to be allocated.

Now, that's all well and good, but I'd actually like to fix all of these issues 
in the two projects. *Does it make sense to move to int64 to track buffer 
sizes*? We can still check for overflow, but this solves the underlying issue 
as well.

Let me know what you think.

> Writing pyarrow Table to Parquet core dumps
> -------------------------------------------
>
>                 Key: ARROW-1167
>                 URL: https://issues.apache.org/jira/browse/ARROW-1167
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Jeff Knupp
>
> When writing a pyarrow Table (instantiated from a Pandas dataframe reading in 
> a ~5GB CSV file) to a parquet file, the interpreter cores with the following 
> stack trace from gdb:
> {code}
> #0  __memmove_avx_unaligned () at 
> ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:181
> #1  0x00007fbaa5c779f1 in parquet::InMemoryOutputStream::Write(unsigned char 
> const*, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #2  0x00007fbaa5c0ce97 in 
> parquet::PlainEncoder<parquet::DataType<(parquet::Type::type)6> 
> >::Put(parquet::ByteArray const*, int) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #3  0x00007fbaa5c18855 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray 
> const*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #4  0x00007fbaa5c189d5 in 
> parquet::TypedColumnWriter<parquet::DataType<(parquet::Type::type)6> 
> >::WriteBatch(long, short const*, short const*, parquet::ByteArray const*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #5  0x00007fbaa5be0900 in arrow::Status 
> parquet::arrow::FileWriter::Impl::TypedWriteBatch<parquet::DataType<(parquet::Type::type)6>,
>  arrow::BinaryType>(parquet::ColumnWriter*, std::shared_ptr<arrow::Array> 
> const&, long, short const*, short const*) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #6  0x00007fbaa5be171d in 
> parquet::arrow::FileWriter::Impl::WriteColumnChunk(arrow::Array const&) () 
> from /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #7  0x00007fbaa5be1dad in 
> parquet::arrow::FileWriter::WriteColumnChunk(arrow::Array const&) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #8  0x00007fbaa5be2047 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
> const&, long) () from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/libparquet.so.1
> #9  0x00007fbaa51e1f53 in 
> __pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
> _object*) ()
>    from 
> /home/ubuntu/.local/lib/python3.5/site-packages/pyarrow/_parquet.cpython-35m-x86_64-linux-gnu.so
> #10 0x00000000004e9bc7 in PyCFunction_Call () at ../Objects/methodobject.c:98
> #11 0x0000000000529885 in do_call (nk=<optimized out>, na=<optimized out>, 
> pp_stack=0x7ffe6510a6c0, func=<optimized out>) at ../Python/ceval.c:4933
> #12 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a6c0) at 
> ../Python/ceval.c:4732
> #13 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #14 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #15 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510a8d0, func=<optimized out>) at 
> ../Python/ceval.c:4813
> #16 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510a8d0) at 
> ../Python/ceval.c:4730
> #17 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #18 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #19 0x0000000000528eee in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510aae0, func=<optimized out>) at 
> ../Python/ceval.c:4813
> #20 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510aae0) at 
> ../Python/ceval.c:4730
> #21 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #22 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510ac10, func=<optimized out>) at 
> ../Python/ceval.c:4803
> #23 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ac10) at 
> ../Python/ceval.c:4730
> #24 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #25 0x0000000000528814 in fast_function (nk=<optimized out>, na=<optimized 
> out>, n=<optimized out>, pp_stack=0x7ffe6510ad40, func=<optimized out>) at 
> ../Python/ceval.c:4803
> #26 call_function (oparg=<optimized out>, pp_stack=0x7ffe6510ad40) at 
> ../Python/ceval.c:4730
> #27 PyEval_EvalFrameEx () at ../Python/ceval.c:3236
> #28 0x000000000052d2e3 in _PyEval_EvalCodeWithName () at 
> ../Python/ceval.c:4018
> #29 0x000000000052dfdf in PyEval_EvalCodeEx () at ../Python/ceval.c:4039
> #30 PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, 
> locals=<optimized out>) at ../Python/ceval.c:777
> #31 0x00000000005fd2c2 in run_mod () at ../Python/pythonrun.c:976
> #32 0x00000000005ff76a in PyRun_FileExFlags () at ../Python/pythonrun.c:929
> #33 0x00000000005ff95c in PyRun_SimpleFileExFlags () at 
> ../Python/pythonrun.c:396
> #34 0x000000000063e7d6 in run_file (p_cf=0x7ffe6510afb0, filename=0x2161260 
> L"scripts/parquet_export.py", fp=0x226fde0) at ../Modules/main.c:318
> #35 Py_Main () at ../Modules/main.c:768
> #36 0x00000000004cfe41 in main () at ../Programs/python.c:65
> #37 0x00007fbadf0db830 in __libc_start_main (main=0x4cfd60 <main>, argc=2, 
> argv=0x7ffe6510b1c8, init=<optimized out>, fini=<optimized out>, 
> rtld_fini=<optimized out>, stack_end=0x7ffe6510b1b8)
>     at ../csu/libc-start.c:291
> #38 0x00000000005d5f29 in _start ()
> {code}
> This is occurring in a pretty vanilla call to `pq.write_table(table, 
> output)`. Before the crash, I'm able to print out the table's schema and it 
> looks a little odd (all columns are explicitly specified in 
> {{pandas.read_csv()}} to be strings...
> {code}
> _id: string
> ref_id: string
> ref_no: string
> stage: string
> stage2_ref_id: string
> org_id: string
> classification: string
> solicitation_no: string
> notice_type: string
> business_category: string
> procurement_mode: string
> funding_instrument: string
> funding_source: string
> approved_budget: string
> publish_date: string
> closing_date: string
> contract_duration: string
> calendar_type: string
> trade_agreement: string
> pre_bid_date: string
> pre_bid_venue: string
> procuring_entity_org_id: string
> procuring_entity_org: string
> client_agency_org_id: string
> client_agency_org: string
> contact_person: string
> contact_person_address: string
> tender_title: string
> description: string
> other_info: string
> reason: string
> created_by: string
> creation_date: string
> modified_date: string
> special_instruction: string
> collection_contact: string
> tender_status: string
> collection_point: string
> date_available: string
> serialid: string
> __index_level_0__: int64
> -- metadata --
> pandas: {"index_columns": ["__index_level_0__"], "columns": [{"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": "_id"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "ref_id"}, {"pandas_type": "unicode", "numpy_type": "object", "metadata": 
> null, "name": "ref_no"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "stage"}, {"pandas_type": "mixed", "numpy_type": 
> "object", "metadata": null, "name": "stage2_ref_id"}, {"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": "org_id"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "classification"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "solicitation_no"}, {"pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null, "name": "notice_type"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "business_category"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "procurement_mode"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "funding_instrument"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "funding_source"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "approved_budget"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "publish_date"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "closing_date"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "contract_duration"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "calendar_type"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "trade_agreement"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "pre_bid_date"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "pre_bid_venue"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "procuring_entity_org_id"}, {"pandas_type": "unicode", "numpy_type": 
> "object", "metadata": null, "name": "procuring_entity_org"}, {"pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null, "name": 
> "client_agency_org_id"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "client_agency_org"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "contact_person"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "contact_person_address"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "tender_title"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "description"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "other_info"}, {"pandas_type": "mixed", "numpy_type": "object", "metadata": 
> null, "name": "reason"}, {"pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null, "name": "created_by"}, {"pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null, "name": "creation_date"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "modified_date"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "special_instruction"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "collection_contact"}, 
> {"pandas_type": "mixed", "numpy_type": "object", "metadata": null, "name": 
> "tender_status"}, {"pandas_type": "mixed", "numpy_type": "object", 
> "metadata": null, "name": "collection_point"}, {"pandas_type": "mixed", 
> "numpy_type": "object", "metadata": null, "name": "date_available"}, 
> {"pandas_type": "unicode", "numpy_type": "object", "metadata": null, "name": 
> "serialid"}, {"pandas_type": "int64", "numpy_type": "int64", "metadata": 
> null, "name": "__index_level_0__"}], "pandas_version": "0.19.2"}
> Segmentation fault (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1167) Writing pyarrow Table to Parquet core dumps

Reply via email to