potential leak when reading parquet using Dataset API

Joris Van den Bossche (Jira) Wed, 26 Oct 2022 01:31:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624296#comment-17624296
 ]


Joris Van den Bossche commented on ARROW-18156:
-----------------------------------------------

I also cannot reproduce this on pyarrow 4.0:

{code}
In [8]: main()
jemalloc
Runs: 5
After run 0: RSS = 6.83 GB, PyArrow Allocated Bytes = 5.94 GB
After run 1: RSS = 8.08 GB, PyArrow Allocated Bytes = 5.94 GB
After run 2: RSS = 8.09 GB, PyArrow Allocated Bytes = 5.94 GB
After run 3: RSS = 8.09 GB, PyArrow Allocated Bytes = 5.94 GB
After run 4: RSS = 8.10 GB, PyArrow Allocated Bytes = 5.94 GB

In [9]: pa.set_memory_pool(pa.system_memory_pool())

In [10]: main()
system
Runs: 5
After run 0: RSS = 6.81 GB, PyArrow Allocated Bytes = 1.31 GB
After run 1: RSS = 8.10 GB, PyArrow Allocated Bytes = 1.31 GB
After run 2: RSS = 7.88 GB, PyArrow Allocated Bytes = 1.31 GB
After run 3: RSS = 8.09 GB, PyArrow Allocated Bytes = 1.31 GB
After run 4: RSS = 8.09 GB, PyArrow Allocated Bytes = 1.31 GB
{code}

And also checked with the system allocator (as you were using that), and it 
also doesn't reproduce the issue. But I do note a curious difference in pyarrow 
allocated bytes, which might also be related to the difference you see for that 
between the datasets API and the ParquetFile API. 

It seems that with the system memory pool, we don't track the data that is 
created by the dataset scanner, but only the additional data created in the 
{{to_pandas}} call (if I remove {{to_pandas}} in the example, the system pool 
indicates 0.0 GB is allocated). [~westonpace] that seems a bug?

[~Norbo11] as Weston mentioned, I would indeed try again without the 
{{to_pandas}} call (that is the more expensive part here), to see if that makes 
a difference

> [Python/C++] High memory usage/potential leak when reading parquet using 
> Dataset API
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-18156
>                 URL: https://issues.apache.org/jira/browse/ARROW-18156
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet
>    Affects Versions: 4.0.1
>            Reporter: Norbert
>            Priority: Major
>
> Hi,
> I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the 
> following snippet:
>  
> {code:java}
> import os
> import pyarrow
> import pyarrow.dataset as ds
> from importlib_metadata import version
> from psutil import Process
> import pyarrow.parquet as pq
> def format_bytes(num_bytes: int):
>     return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
>  
> def main():
>     print(version("pyarrow"))
>     print(pyarrow.default_memory_pool().backend_name)
>     process = Process(os.getpid())
>     runs = 10
>     print(f"Runs: {runs}")
>     for i in range(runs):
>         dataset = ds.dataset("df.pq")
>         table = dataset.to_table()
>         df = table.to_pandas()
>         print(f"After run {i}: RSS = 
> {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes = 
> {format_bytes(pyarrow.total_allocated_bytes())}")
> {code}
>  
>  
> On PyArrow v4.0.1 the output is as follows:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code}
> If I replace ds.dataset("df.pq").to_table() with 
> pq.ParquetFile("df.pq").read(), the output is:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code}
> The usage profile of the older non-dataset API is much lower - it matches the 
> size of the dataframe much closer. It also seems like in the former example, 
> there is a memory leak? I thought that the increase in RSS was just due to 
> PyArrow's usage of jemalloc, but I seem to be using the system allocator here.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18156) [Python/C++] High memory usage/potential leak when reading parquet using Dataset API

Reply via email to