Gianluca Ficarelli created ARROW-17399:
------------------------------------------
Summary: pyarrow may use a lot of memory to load a dataframe from
parquet
Key: ARROW-17399
URL: https://issues.apache.org/jira/browse/ARROW-17399
Project: Apache Arrow
Issue Type: Bug
Components: Parquet, Python
Affects Versions: 9.0.0
Environment: linux
Reporter: Gianluca Ficarelli
Attachments: memory-profiler.png
When a pandas dataframe is loaded from a parquet file using
{{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than
what should be needed to load the dataframe, and it's not freed until the
dataframe is deleted.
The problem is evident when the dataframe has a {*}column containing lists or
numpy arrays{*}, while it seems absent (or not noticeable) if the column
contains only integer or floats.
I'm attaching a simple script to reproduce the issue, and a graph created with
memory-profiler showing the memory usage.
In this example, the dataframe created with pandas needs around 1.2 GB, but the
memory usage after loading it from parquet is around 16 GB.
The items of the column are created as numpy arrays and not lists, to be
consistent with the types loaded from parquet (pyarrow produces numoy arrays
and not lists).
{code:python}
import gc
import time
import numpy as np
import pandas as pd
import pyarrow
import pyarrow.parquet
import psutil
def pyarrow_dump(filename, df, compression="snappy"):
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table, filename, compression=compression)
def pyarrow_load(filename):
table = pyarrow.parquet.read_table(filename)
return table.to_pandas()
def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
# gc.collect()
current_time = time.monotonic() - start_time
rss = process.memory_info().rss / 2 ** 20
print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
if __name__ == "__main__":
print_mem(0)
rows = 5000000
df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
print_mem(1)
pyarrow_dump("example.parquet", df)
print_mem(2)
del df
print_mem(3)
time.sleep(3)
print_mem(4)
df = pyarrow_load("example.parquet")
print_mem(5)
time.sleep(3)
print_mem(6)
del df
print_mem(7)
time.sleep(3)
print_mem(8)
{code}
Run with memory-profiler:
{code:bash}
mprof run --multiprocess python test_pyarrow.py
{code}
Output:
{code:java}
mprof: Sampling memory every 0.1s
running new process
0 time: 0.0 rss: 135.4
1 time: 4.9 rss: 1252.2
2 time: 7.1 rss: 1265.0
3 time: 7.5 rss: 760.2
4 time: 10.7 rss: 758.9
5 time: 19.6 rss: 16745.4
6 time: 22.6 rss: 16335.4
7 time: 22.9 rss: 15833.0
8 time: 25.9 rss: 955.0
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)