[
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597904#comment-17597904
]
Gianluca Ficarelli commented on ARROW-17399:
--------------------------------------------
I tested on Linux the previous versions of pyarrow, here are the results:
* pyarrow==9.0.0
{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 90.6
1 time: 2.8 rss: 1205.4
2 time: 4.4 rss: 1212.3
3 time: 4.7 rss: 709.7
4 time: 7.8 rss: 707.9
5 time: 14.4 rss: 16656.0
6 time: 17.4 rss: 16246.0
7 time: 17.5 rss: 15743.6
8 time: 20.5 rss: 866.3{code}
* pyarrow==8.0.0
{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 86.2
1 time: 2.8 rss: 1200.9
2 time: 4.3 rss: 2266.2
3 time: 4.6 rss: 1443.6
4 time: 7.7 rss: 703.5
5 time: 14.3 rss: 16648.0
6 time: 17.3 rss: 16238.0
7 time: 17.4 rss: 15738.6
8 time: 20.4 rss: 861.3{code}
* pyarrow==7.0.0
{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 84.3
1 time: 2.8 rss: 1199.1
2 time: 4.4 rss: 2263.7
3 time: 4.6 rss: 1441.8
4 time: 7.7 rss: 701.6
5 time: 9.8 rss: 3679.9
6 time: 12.8 rss: 3268.3
7 time: 12.9 rss: 2766.6
8 time: 15.9 rss: 859.2
{code}
* pyarrow==6.0.1
{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.1
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 81.9
1 time: 2.9 rss: 1196.8
2 time: 4.5 rss: 2261.4
3 time: 4.7 rss: 1439.0
4 time: 7.8 rss: 698.9
5 time: 9.2 rss: 2224.0
6 time: 12.2 rss: 1740.4
7 time: 12.3 rss: 1238.1
8 time: 15.3 rss: 856.6{code}
* pyarrow==6.0.0
{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 81.7
1 time: 2.9 rss: 1196.6
2 time: 4.5 rss: 2261.1
3 time: 4.7 rss: 1438.5
4 time: 7.8 rss: 698.4
5 time: 9.2 rss: 2224.9
6 time: 12.2 rss: 1740.1
7 time: 12.3 rss: 1237.7
8 time: 15.3 rss: 856.2{code}
* pyarrow==5.0.0
{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==5.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0
python test_pyarrow_orig.py
0 time: 0.0 rss: 79.2
1 time: 2.8 rss: 1194.0
2 time: 4.3 rss: 2258.3
3 time: 4.5 rss: 1436.2
4 time: 7.7 rss: 696.1
5 time: 9.1 rss: 2221.1
6 time: 12.1 rss: 1736.3
7 time: 12.2 rss: 1235.0
8 time: 15.3 rss: 853.5{code}
So:
* with pyarrow 9.0.0 and 8.0.0 the results are similar
* with pyarrow 7.0.0 the used memory seems lower
* with pyarrow 6.0.0, 6.0.1, 5.0.0 the used memory is even lower
> pyarrow may use a lot of memory to load a dataframe from parquet
> ----------------------------------------------------------------
>
> Key: ARROW-17399
> URL: https://issues.apache.org/jira/browse/ARROW-17399
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 9.0.0
> Environment: linux
> Reporter: Gianluca Ficarelli
> Priority: Major
> Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than
> what should be needed to load the dataframe, and it's not freed until the
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or
> numpy arrays{*}, while it seems absent (or not noticeable) if the column
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be
> consistent with the types loaded from parquet (pyarrow produces numpy arrays
> and not lists).
>
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
> table = pyarrow.parquet.read_table(filename)
> return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
> # gc.collect()
> current_time = time.monotonic() - start_time
> rss = process.memory_info().rss / 2 ** 20
> print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
> if __name__ == "__main__":
> print_mem(0)
> rows = 5000000
> df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
> print_mem(1)
>
> pyarrow_dump("example.parquet", df)
> print_mem(2)
>
> del df
> print_mem(3)
> time.sleep(3)
> print_mem(4)
> df = pyarrow_load("example.parquet")
> print_mem(5)
> time.sleep(3)
> print_mem(6)
> del df
> print_mem(7)
> time.sleep(3)
> print_mem(8)
> {code}
> Run with memory-profiler:
> {code:bash}
> mprof run --multiprocess python test_pyarrow.py
> {code}
> Output:
> {code:java}
> mprof: Sampling memory every 0.1s
> running new process
> 0 time: 0.0 rss: 135.4
> 1 time: 4.9 rss: 1252.2
> 2 time: 7.1 rss: 1265.0
> 3 time: 7.5 rss: 760.2
> 4 time: 10.7 rss: 758.9
> 5 time: 19.6 rss: 16745.4
> 6 time: 22.6 rss: 16335.4
> 7 time: 22.9 rss: 15833.0
> 8 time: 25.9 rss: 955.0
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)