potential leak when reading parquet using Dataset API

Norbert (Jira) Wed, 26 Oct 2022 05:45:47 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624434#comment-17624434
 ]


Norbert commented on ARROW-18156:
---------------------------------

With the following operations (no to_pandas()) call:

 
{code:java}
table = ds.dataset("new_df.pq").to_table(){code}
 

The output is as follows:

 
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 5.55 GB, PyArrow Allocated Bytes = 4.63 GB
After run 1: RSS = 10.94 GB, PyArrow Allocated Bytes = 4.63 GB
After run 2: RSS = 12.55 GB, PyArrow Allocated Bytes = 4.63 GB
After run 3: RSS = 13.89 GB, PyArrow Allocated Bytes = 4.63 GB
After run 4: RSS = 14.57 GB, PyArrow Allocated Bytes = 4.63 GB
After run 5: RSS = 15.72 GB, PyArrow Allocated Bytes = 4.63 GB
After run 6: RSS = 16.62 GB, PyArrow Allocated Bytes = 4.63 GB
After run 7: RSS = 16.75 GB, PyArrow Allocated Bytes = 4.63 GB
After run 8: RSS = 17.41 GB, PyArrow Allocated Bytes = 4.63 GB
After run 9: RSS = 17.55 GB, PyArrow Allocated Bytes = 4.63 GB{code}
 

With the following operations (no to_pandas()) call:

 
{code:java}
table = pq.ParquetFile("new_df.pq").read(){code}
 

The output is as follows:

 
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 4.79 GB, PyArrow Allocated Bytes = 4.63 GB
After run 1: RSS = 4.80 GB, PyArrow Allocated Bytes = 4.63 GB
After run 2: RSS = 4.81 GB, PyArrow Allocated Bytes = 4.63 GB
After run 3: RSS = 4.87 GB, PyArrow Allocated Bytes = 4.63 GB
After run 4: RSS = 4.91 GB, PyArrow Allocated Bytes = 4.63 GB
After run 5: RSS = 4.95 GB, PyArrow Allocated Bytes = 4.63 GB
After run 6: RSS = 5.01 GB, PyArrow Allocated Bytes = 4.63 GB
After run 7: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB
After run 8: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB
After run 9: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB{code}
 

table.nbytes is 4971300000 in both cases. 

The issue is still present - the memory profile of using 
ds.dataset("new_df.pq").to_table() is much higher than 
pq.ParquetFile("new_df.pq").read().

One very interesting thing is that in the pq.ParquetFile() case, removing the 
to_pandas() call has actually led to *higher* memory usage at the end of the 
call. Compare:

pq.ParquetFile("new_df.pq").read()

 
{code:java}
After run 0: RSS = 4.91 GB, PyArrow Allocated Bytes = 4.75 GB{code}
 

pq.ParquetFile("new_df.pq").read().to_pandas()

 
{code:java}
After run 0: RSS = 2.37 GB, PyArrow Allocated Bytes = 1.34 GB{code}
I'm guessing this is because when you introduce the to_pandas() call on same 
line, the refcount of the pyarrow.Table object goes down to 0 and it gets 
garbage collected - and somehow the pyarrow.Table representation of the data 
consumes 2x more memory than the pd.DataFrame represention.

 

[~jorisvandenbossche] regarding what you've discovered with byte tracking - as 
you can see above, for me removing to_pandas does not reduce the reported bytes 
allocated by the pool down to 0.0 GB, it stays at 4.63 GB. Your original reply 
seems to indicate that it's a tracking problem, but the bug report you filed 
claims that the Dataset scanner always uses the jemalloc allocator. However, if 
I do pyarrow.jemalloc_memory_pool() I get the following output:
{code:java}
ArrowNotImplementedError: This Arrow build does not enable jemalloc{code}
So in my case, is it actually using the system allocator and failing to track 
bytes properly, or doing something completely different?

What else could affect behavior here? Python version? How the pyarrow package 
was installed?

 

> [Python/C++] High memory usage/potential leak when reading parquet using 
> Dataset API
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-18156
>                 URL: https://issues.apache.org/jira/browse/ARROW-18156
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet
>    Affects Versions: 4.0.1
>            Reporter: Norbert
>            Priority: Major
>
> Hi,
> I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the 
> following snippet:
>  
> {code:java}
> import os
> import pyarrow
> import pyarrow.dataset as ds
> from importlib_metadata import version
> from psutil import Process
> import pyarrow.parquet as pq
> def format_bytes(num_bytes: int):
>     return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
>  
> def main():
>     print(version("pyarrow"))
>     print(pyarrow.default_memory_pool().backend_name)
>     process = Process(os.getpid())
>     runs = 10
>     print(f"Runs: {runs}")
>     for i in range(runs):
>         dataset = ds.dataset("df.pq")
>         table = dataset.to_table()
>         df = table.to_pandas()
>         print(f"After run {i}: RSS = 
> {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes = 
> {format_bytes(pyarrow.total_allocated_bytes())}")
> {code}
>  
>  
> On PyArrow v4.0.1 the output is as follows:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code}
> If I replace ds.dataset("df.pq").to_table() with 
> pq.ParquetFile("df.pq").read(), the output is:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code}
> The usage profile of the older non-dataset API is much lower - it matches the 
> size of the dataframe much closer. It also seems like in the former example, 
> there is a memory leak? I thought that the increase in RSS was just due to 
> PyArrow's usage of jemalloc, but I seem to be using the system allocator here.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18156) [Python/C++] High memory usage/potential leak when reading parquet using Dataset API

Reply via email to