[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

Gianluca Ficarelli (Jira) Tue, 30 Aug 2022 07:31:12 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597904#comment-17597904
 ]


Gianluca Ficarelli commented on ARROW-17399:
--------------------------------------------

I tested on Linux the previous versions of pyarrow, here are the results:
 * pyarrow==9.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==9.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      90.6
  1 time:       2.8 rss:    1205.4
  2 time:       4.4 rss:    1212.3
  3 time:       4.7 rss:     709.7
  4 time:       7.8 rss:     707.9
  5 time:      14.4 rss:   16656.0
  6 time:      17.4 rss:   16246.0
  7 time:      17.5 rss:   15743.6
  8 time:      20.5 rss:     866.3{code}
 * pyarrow==8.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      86.2
  1 time:       2.8 rss:    1200.9
  2 time:       4.3 rss:    2266.2
  3 time:       4.6 rss:    1443.6
  4 time:       7.7 rss:     703.5
  5 time:      14.3 rss:   16648.0
  6 time:      17.3 rss:   16238.0
  7 time:      17.4 rss:   15738.6
  8 time:      20.4 rss:     861.3{code}
 * pyarrow==7.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==7.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      84.3
  1 time:       2.8 rss:    1199.1
  2 time:       4.4 rss:    2263.7
  3 time:       4.6 rss:    1441.8
  4 time:       7.7 rss:     701.6
  5 time:       9.8 rss:    3679.9
  6 time:      12.8 rss:    3268.3
  7 time:      12.9 rss:    2766.6
  8 time:      15.9 rss:     859.2
 {code}
 * pyarrow==6.0.1

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.1
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.9
  1 time:       2.9 rss:    1196.8
  2 time:       4.5 rss:    2261.4
  3 time:       4.7 rss:    1439.0
  4 time:       7.8 rss:     698.9
  5 time:       9.2 rss:    2224.0
  6 time:      12.2 rss:    1740.4
  7 time:      12.3 rss:    1238.1
  8 time:      15.3 rss:     856.6{code}
 * pyarrow==6.0.0

{code:java}
 pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==6.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      81.7
  1 time:       2.9 rss:    1196.6
  2 time:       4.5 rss:    2261.1
  3 time:       4.7 rss:    1438.5
  4 time:       7.8 rss:     698.4
  5 time:       9.2 rss:    2224.9
  6 time:      12.2 rss:    1740.1
  7 time:      12.3 rss:    1237.7
  8 time:      15.3 rss:     856.2{code}
 * pyarrow==5.0.0

{code:java}
pip freeze
numpy==1.23.2
pandas==1.4.3
psutil==5.9.1
pyarrow==5.0.0
python-dateutil==2.8.2
pytz==2022.2.1
six==1.16.0

python test_pyarrow_orig.py 
  0 time:       0.0 rss:      79.2
  1 time:       2.8 rss:    1194.0
  2 time:       4.3 rss:    2258.3
  3 time:       4.5 rss:    1436.2
  4 time:       7.7 rss:     696.1
  5 time:       9.1 rss:    2221.1
  6 time:      12.1 rss:    1736.3
  7 time:      12.2 rss:    1235.0
  8 time:      15.3 rss:     853.5{code}
So:
 * with pyarrow 9.0.0 and 8.0.0 the results are similar
 * with pyarrow 7.0.0 the used memory seems lower
 * with pyarrow 6.0.0, 6.0.1, 5.0.0 the used memory is even lower

> pyarrow may use a lot of memory to load a dataframe from parquet
> ----------------------------------------------------------------
>
>                 Key: ARROW-17399
>                 URL: https://issues.apache.org/jira/browse/ARROW-17399
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 9.0.0
>         Environment: linux
>            Reporter: Gianluca Ficarelli
>            Priority: Major
>         Attachments: memory-profiler.png
>
>
> When a pandas dataframe is loaded from a parquet file using 
> {{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
> what should be needed to load the dataframe, and it's not freed until the 
> dataframe is deleted.
> The problem is evident when the dataframe has a {*}column containing lists or 
> numpy arrays{*}, while it seems absent (or not noticeable) if the column 
> contains only integer or floats.
> I'm attaching a simple script to reproduce the issue, and a graph created 
> with memory-profiler showing the memory usage.
> In this example, the dataframe created with pandas needs around 1.2 GB, but 
> the memory usage after loading it from parquet is around 16 GB.
> The items of the column are created as numpy arrays and not lists, to be 
> consistent with the types loaded from parquet (pyarrow produces numpy arrays 
> and not lists).
>  
> {code:python}
> import gc
> import time
> import numpy as np
> import pandas as pd
> import pyarrow
> import pyarrow.parquet
> import psutil
> def pyarrow_dump(filename, df, compression="snappy"):
>     table = pyarrow.Table.from_pandas(df)
>     pyarrow.parquet.write_table(table, filename, compression=compression)
> def pyarrow_load(filename):
>     table = pyarrow.parquet.read_table(filename)
>     return table.to_pandas()
> def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
>     # gc.collect()
>     current_time = time.monotonic() - start_time
>     rss = process.memory_info().rss / 2 ** 20
>     print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")
> if __name__ == "__main__":
>     print_mem(0)
>     rows = 5000000
>     df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
>     print_mem(1)
>     
>     pyarrow_dump("example.parquet", df)
>     print_mem(2)
>     
>     del df
>     print_mem(3)
>     time.sleep(3)
>     print_mem(4)
>     df = pyarrow_load("example.parquet")
>     print_mem(5)
>     time.sleep(3)
>     print_mem(6)
>     del df
>     print_mem(7)
>     time.sleep(3)
>     print_mem(8)
> {code}
> Run with memory-profiler:
> {code:bash}
> mprof run --multiprocess python test_pyarrow.py
> {code}
> Output:
> {code:java}
> mprof: Sampling memory every 0.1s
> running new process
>   0 time:       0.0 rss:     135.4
>   1 time:       4.9 rss:    1252.2
>   2 time:       7.1 rss:    1265.0
>   3 time:       7.5 rss:     760.2
>   4 time:      10.7 rss:     758.9
>   5 time:      19.6 rss:   16745.4
>   6 time:      22.6 rss:   16335.4
>   7 time:      22.9 rss:   15833.0
>   8 time:      25.9 rss:     955.0
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

Reply via email to