[
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597832#comment-17597832
]
Gianluca Ficarelli edited comment on ARROW-17399 at 8/30/22 12:14 PM:
----------------------------------------------------------------------
I tried with a fresh virtualenv, on both Linux and Mac (Intel):
Linux (Ubuntu 20.04, 32 GB):
{code:java}
$ python -V
Python 3.9.9
$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
$ python test_pyarrow.py
0 time: 0.0 rss: 90.8
1 time: 3.0 rss: 1205.7
2 time: 4.6 rss: 1212.6
3 time: 4.8 rss: 710.0
4 time: 8.0 rss: 708.2
5 time: 14.6 rss: 16652.9
6 time: 17.6 rss: 16242.9
7 time: 17.7 rss: 15743.5
8 time: 20.7 rss: 866.2
{code}
Mac (Monterey 12.5, 16 GB):
{code:java}
$ python -V
Python 3.9.9
$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
$ python test_pyarrow.py
0 time: 0.0 rss: 64.0
1 time: 4.0 rss: 1075.0
2 time: 6.2 rss: 1136.6
3 time: 6.8 rss: 671.8
4 time: 9.8 rss: 671.8
5 time: 22.9 rss: 2477.4
6 time: 25.9 rss: 2423.4
7 time: 27.1 rss: 180.6
8 time: 30.1 rss: 180.6
{code}
but when the same script is retried there is some variability on Mac in the
lines 5 and 6 (I observed from 1261 to 4140 MB), while on Linux is always the
same (around 16 GB in lines 5, 6, 7).
So it seems that the rss memory usage is high on linux only.
was (Author: JIRAUSER294344):
I tried with a fresh virtualenv, on both Linux and Mac (Intel):
Linux (Ubuntu 20.04, 32 GB):
{code:java}
$ python -V
Python 3.9.9
$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
$ python test_pyarrow.py
0 time: 0.0 rss: 90.8
1 time: 3.0 rss: 1205.7
2 time: 4.6 rss: 1212.6
3 time: 4.8 rss: 710.0
4 time: 8.0 rss: 708.2
5 time: 14.6 rss: 16652.9
6 time: 17.6 rss: 16242.9
7 time: 17.7 rss: 15743.5
8 time: 20.7 rss: 866.2
{code}
Mac (Monterey 12.5, 16 GB):
{code:java}
$ python -V
Python 3.9.9
$ pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
$ python test_pyarrow.py
0 time: 0.0 rss: 64.0
1 time: 4.0 rss: 1075.0
2 time: 6.2 rss: 1136.6
3 time: 6.8 rss: 671.8
4 time: 9.8 rss: 671.8
5 time: 22.9 rss: 2477.4
6 time: 25.9 rss: 2423.4
7 time: 27.1 rss: 180.6
8 time: 30.1 rss: 180.6
{code}
but when the same script is retried there is some variability on Mac in the
lines 5 and 6 (I observed from 1261 to 4140 MB), while on Linux is always the
same.
So it seems that the rss memory usage is high on linux only.
> pyarrow may use a lot of memory to load a dataframe from parquet
> ----------------------------------------------------------------
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 9.0.0
> Environment: linux
> Reporter: Gianluca Ficarelli
> Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than
> what should be needed to load the dataframe, and it's not freed until the
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or
> numpy arrays{*}, while it seems absent (or not noticeable) if the column
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be
> consistent with the types loaded from parquet (pyarrow produces numpy arrays
> and not lists).
>
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
> table = pyarrow.parquet.read_table(filename)
> return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
> # gc.collect()
> current_time = time.monotonic() - start_time
> rss = process.memory_info().rss / 2 ** 20
> print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
> if __name__ == "__main__":
> print_mem(0)
> rows = 5000000
> df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
> print_mem(1)
>
> pyarrow_dump("example.parquet", df)
> print_mem(2)
>
> del df
> print_mem(3)
> time.sleep(3)
> print_mem(4)
> df = pyarrow_load("example.parquet")
> print_mem(5)
> time.sleep(3)
> print_mem(6)
> del df
> print_mem(7)
> time.sleep(3)
> print_mem(8)
> {code}
> Run with memory-profiler:
> {code:bash}
> mprof run --multiprocess python test_pyarrow.py
> {code}
> Output:
> {code:java}
> mprof: Sampling memory every 0.1s
> running new process
> 0 time: 0.0 rss: 135.4
> 1 time: 4.9 rss: 1252.2
> 2 time: 7.1 rss: 1265.0
> 3 time: 7.5 rss: 760.2
> 4 time: 10.7 rss: 758.9
> 5 time: 19.6 rss: 16745.4
> 6 time: 22.6 rss: 16335.4
> 7 time: 22.9 rss: 15833.0
> 8 time: 25.9 rss: 955.0
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)