Hi, I'm being unable to process a Pandas `DataFrame` (stored as a Parquet
dataset -multiple parts, `snappy` compression). Python is being killed due to
reaching memory limit.
## Steps to reproduce
Sample data:
```bash
$ du -ach *
0 _SUCCESS
11M part-00000-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
10M part-00001-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M part-00002-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M part-00003-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M part-00004-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M part-00005-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M part-00006-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.7M part-00007-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.7M part-00008-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.7M part-00009-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00010-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00011-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00012-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00013-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00014-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00015-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00016-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00017-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00018-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M part-00019-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.5M part-00020-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M part-00021-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M part-00022-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M part-00023-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M part-00024-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
7.7M part-00025-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
6.4M part-00026-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
6.3M part-00027-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
261M total
```
Python code:
```python
>>> import pyarrow.parquet as pq
>>> dataset = pq.ParquetDataset('.')
>>> dataset.read_pandas()
```
Now the python process memory allocation reaches (in kilobytes):
```bash
$ ps -eo size,command | grep python
9064128 python
```
When the read is chained with an export to pandas, it results in reaching a
memory limit and killing Python.
```python
>>> import pyarrow.parquet as pq
>>> dataset = pq.ParquetDataset('.')
>>> dataset.read_pandas().to_pandas()
[1] 610 killed python
```
Last known memory allocation size for Python before being killed is:
```
$ while sleep 1; do ps -eo size,command | grep " python$"; done
...
20018716 python
```
So, that means that `pyarrow` consumes 20GB worth of memory to process a 261M
dataset.
In comparison, `fastparquet` is able to process the same dataset with Python
comsuming about 3,6GB of memory:
```python
>>> from glob import glob
>>> from fastparquet import ParquetFile
>>> paths = glob('*.parquet)
>>> pf = ParquetFile(paths)
>>> dataframe = pf.to_pandas()
```
```bash
$ ps -eo size,command | grep " python$"
3795128 python
```
## Version info
```python
>>> import sys, pyarrow; print(pyarrow.__version__, sys.version)
0.10.0 3.6.6 (default, Sep 13 2018, 16:29:18)
[GCC 8.2.1 20180831]
```
[ Full content available at: https://github.com/apache/arrow/issues/2624 ]
This message was relayed via gitbox.apache.org for [email protected]