[ 
https://issues.apache.org/jira/browse/ARROW-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275374#comment-16275374
 ] 

DB Tsai commented on ARROW-1873:
--------------------------------

Doing some digging in a beefy machine, and it runs in big machine! 

We have two variances of the same data in different schema. One is 
{code:java}
root
 |-- dateint: integer (nullable = true)
 |-- profileid: long (nullable = true)
 |-- time: long (nullable = true)
 |-- label: double (nullable = true)
 |-- weight: double (nullable = true)
 |-- features: array (nullable = true)
 |    |-- element: double (containsNull = true)
{code}
as previously shown, and it's 2.7G. 

The other one is in struct, and the schema is like
{code:java}
root
 |-- dateint: integer (nullable = true)
 |-- profileid: long (nullable = true)
 |-- time: long (nullable = true)
 |-- label: double (nullable = true)
 |-- weight: double (nullable = true)
 |-- features: struct (nullable = true)
 |    |-- f1: double (nullable = true)
 |    |-- f2: double (nullable = true)
 ...........
 |    |-- f14: double (nullable = true)
{code}
and the total size is 1.9G.

It does make sense that the files in struct are smaller due to better 
compression in columnar format. 

However, when the first one is loaded into memory, it takes 22.32GB, while the 
second one takes 127.97GB. In theory, for uncompressed data, it should at most 
take {{93993056 * (5 + 14) * 8 bytes = 14GB}}. Where is the overhead coming 
from? In particularly, the second one in struct is resulting almost 10x 
overhead. Is it possible to load those data in memory with compression like 
Spark's dataframe cache? Is it possible to lazily load the data, and only 
deserialize the data when in need?

Regardless of the overhead issue, we should figure out how to return OOM 
instead of segmentation fault for easier debugging. 

They are long questions, and thank you for helping out.

> Segmentation fault when loading total 2GB of parquet files
> ----------------------------------------------------------
>
>                 Key: ARROW-1873
>                 URL: https://issues.apache.org/jira/browse/ARROW-1873
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: DB Tsai
>             Fix For: 0.8.0
>
>
> We are trying to load 100 parquet files, and each of them is around 20MB. 
> Before we port [ARROW-1830] into our pyarrow distribution, we use {{glob}} to 
> list all the files, and then load them as pandas dataframe through pyarrow. 
> The schema of the parquet files is like 
> {code:java}
> root
>  |-- dateint: integer (nullable = true)
>  |-- profileid: long (nullable = true)
>  |-- time: long (nullable = true)
>  |-- label: double (nullable = true)
>  |-- weight: double (nullable = true)
>  |-- features: array (nullable = true)
>  |    |-- element: double (containsNull = true)
> {code}
> If we only load couple of them, it works without any issue. However, when 
> loading 100 of them, we got segmentation fault as the following. FYI, if we 
> flatten {{features: array[double]}} into top level, the file sizes are around 
> the same, and work fine too. 
> Is there anything we can try to eliminate this issue? Thanks.
> {code}
> >>> import glob
> >>> files = glob.glob("/home/dbt/data/*")
> >>> data = pq.ParquetDataset(files).read().to_pandas()
> [New Thread 0x7fffe8f84700 (LWP 23769)]
> [New Thread 0x7fffe3b93700 (LWP 23770)]
> [New Thread 0x7fffe3392700 (LWP 23771)]
> [New Thread 0x7fffe2b91700 (LWP 23772)]
> [Thread 0x7fffe2b91700 (LWP 23772) exited]
> [Thread 0x7fffe3b93700 (LWP 23770) exited]
> Thread 4 "python" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffe3392700 (LWP 23771)]
> 0x00007ffff270fc94 in arrow::Status 
> arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>    from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> (gdb) backtrace
> #0  0x00007ffff270fc94 in arrow::Status 
> arrow::VisitTypeInline<arrow::py::ArrowDeserializer>(arrow::DataType const&, 
> arrow::py::ArrowDeserializer*) ()
>    from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #1  0x00007ffff2700b5a in 
> arrow::py::ConvertColumnToPandas(arrow::py::PandasOptions, 
> std::shared_ptr<arrow::Column> const&, _object*, _object**) ()
>    from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #2  0x00007ffff2714985 in arrow::Status 
> arrow::py::ConvertListsLike<arrow::DoubleType>(arrow::py::PandasOptions, 
> std::shared_ptr<arrow::Column> const&, _object**) () from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #3  0x00007ffff2716b92 in 
> arrow::py::ObjectBlock::Write(std::shared_ptr<arrow::Column> const&, long, 
> long) ()
>    from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #4  0x00007ffff270a489 in 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}::operator()(int)
>  const ()
>    from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #5  0x00007ffff270a67c in std::thread::_Impl<std::_Bind_simple<arrow::Status 
> arrow::ParallelFor<arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&>(int,
>  int, 
> arrow::py::DataFrameBlockCreator::WriteTableToBlocks(int)::{lambda(int)#1}&)::{lambda()#1}
>  ()> >::_M_run() ()
>    from 
> /home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libarrow_python.so.0
> #6  0x00007ffff1e30c5c in std::execute_native_thread_routine_compat 
> (__p=<optimized out>)
>     at 
> /opt/conda/conda-bld/compilers_linux-64_1505664199673/work/.build/src/gcc-7.2.0/libstdc++-v3/src/c++11/thread.cc:110
> #7  0x00007ffff7bc16ba in start_thread (arg=0x7fffe3392700) at 
> pthread_create.c:333
> #8  0x00007ffff78f73dd in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to