Hi, I'm being unable to process a Pandas `DataFrame` (stored as a Parquet 
dataset -multiple parts, `snappy` compression). Python is being killed due to 
reaching memory limit.

## Steps to reproduce

Sample data:
```bash
$ du -ach  *
0       _SUCCESS
11M     part-00000-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
10M     part-00001-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M    part-00002-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M    part-00003-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M    part-00004-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M    part-00005-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.8M    part-00006-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.7M    part-00007-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.7M    part-00008-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.7M    part-00009-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00010-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00011-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00012-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00013-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00014-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00015-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00016-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00017-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00018-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.6M    part-00019-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.5M    part-00020-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M    part-00021-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M    part-00022-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M    part-00023-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
9.4M    part-00024-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
7.7M    part-00025-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
6.4M    part-00026-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
6.3M    part-00027-90655d30-aacc-469c-9c77-ba991a964c68-c000.snappy.parquet
261M    total
```

Python code:
```python
>>> import pyarrow.parquet as pq
>>> dataset = pq.ParquetDataset('.')
>>> dataset.read_pandas()
```

Now the python process memory allocation reaches (in kilobytes):
```bash
$ ps -eo size,command | grep python
9064128 python
```

When the read is chained with an export to pandas, it results in reaching a 
memory limit and killing Python.
```python
>>> import pyarrow.parquet as pq
>>> dataset = pq.ParquetDataset('.')
>>> dataset.read_pandas().to_pandas()
[1]    610 killed     python
```

Last known memory allocation size for Python before being killed is:

```
$ while sleep 1; do ps -eo size,command | grep " python$"; done
...
20018716 python
```

So, that means that `pyarrow` consumes 20GB worth of memory to process a 261M 
dataset.

In comparison, `fastparquet` is able to process the same dataset with Python 
comsuming about 3,6GB of memory:

```python
>>> from glob import glob
>>> from fastparquet import ParquetFile
>>> paths = glob('*.parquet)
>>> pf = ParquetFile(paths)
>>> dataframe = pf.to_pandas()
```

```bash
$ ps -eo size,command | grep " python$"
3795128 python
```

## Version info
```python
>>> import sys, pyarrow; print(pyarrow.__version__, sys.version)
0.10.0 3.6.6 (default, Sep 13 2018, 16:29:18)
[GCC 8.2.1 20180831]
```

[ Full content available at: https://github.com/apache/arrow/issues/2624 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to