[
https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282516#comment-16282516
]
ASF GitHub Bot commented on ARROW-1873:
---------------------------------------
wesm opened a new pull request #1404: ARROW-1873: [Python] Catch more possible
Python/OOM errors in to_pandas conversion path
URL: https://github.com/apache/arrow/pull/1404
I also ran into a gnarly method dispatching bug ARROW-1904 while working on
this. I will address that deprecation in a separate patch
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [Python] Segmentation fault when loading total 2GB of parquet files
> -------------------------------------------------------------------
>
> Key: ARROW-1873
> URL: https://issues.apache.org/jira/browse/ARROW-1873
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: DB Tsai
> Assignee: Wes McKinney
> Labels: pull-request-available
> Fix For: 0.8.0
>
>
> We are trying to load 100 parquet files, and each of them is around 20MB.
> Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to
> list all the files, and then load them as pandas dataframe through pyarrow.
> The schema of the parquet files is like
> {code:java}
> root
> |-- dateint: integer (nullable = true)
> |-- profileid: long (nullable = true)
> |-- time: long (nullable = true)
> |-- label: double (nullable = true)
> |-- weight: double (nullable = true)
> |-- features: array (nullable = true)
> | |-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when
> loading 100 of them, we got segmentation fault as the following. FYI, if we
> flatten {{features: array[double]}} into top level, the file sizes are around
> the same, and work fine too.
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x00007ffff270fc94 in arrow::Status
> arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&,
> arrow::py::ArrowDeserializer*) ()
> from
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0 0x00007ffff270fc94 in arrow::Status
> arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&,
> arrow::py::ArrowDeserializer*) ()
> from
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1 0x00007ffff2700b5a in
> arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions,
> std::shared_ptr<arrow::Column> const&, _object*, _object**) ()
> from
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2 0x00007ffff2714985 in arrow::Status
> arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions,
> std::shared_ptr<arrow::Column> const&, _object**) () from
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3 0x00007ffff2716b92 in
> arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long,
> long) ()
> from
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4 0x00007ffff270a489 in
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int)
> const ()
> from
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5 0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status
> arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int,
> int,
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1}
> ()> >::_M_run() ()
> from
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6 0x00007ffff1e30c5c in std::execute_native_thread_routine_compat
> (__p=<optimized out>)
> at
> /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7 0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at
> pthread_create.c:333
> #8 0x00007ffff78f73dd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)