potential leak when reading parquet using Dataset API

Norbert (Jira) Wed, 26 Oct 2022 05:53:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624434#comment-17624434
 ]


Norbert edited comment on ARROW-18156 at 10/26/22 12:52 PM:
------------------------------------------------------------

With the following operations (no to_pandas()) call:
{code:java}
table = ds.dataset("new_df.pq").to_table(){code}
The output is as follows:
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 5.55 GB, PyArrow Allocated Bytes = 4.63 GB
After run 1: RSS = 10.94 GB, PyArrow Allocated Bytes = 4.63 GB
After run 2: RSS = 12.55 GB, PyArrow Allocated Bytes = 4.63 GB
After run 3: RSS = 13.89 GB, PyArrow Allocated Bytes = 4.63 GB
After run 4: RSS = 14.57 GB, PyArrow Allocated Bytes = 4.63 GB
After run 5: RSS = 15.72 GB, PyArrow Allocated Bytes = 4.63 GB
After run 6: RSS = 16.62 GB, PyArrow Allocated Bytes = 4.63 GB
After run 7: RSS = 16.75 GB, PyArrow Allocated Bytes = 4.63 GB
After run 8: RSS = 17.41 GB, PyArrow Allocated Bytes = 4.63 GB
After run 9: RSS = 17.55 GB, PyArrow Allocated Bytes = 4.63 GB{code}
With the following operations (no to_pandas()) call:
{code:java}
table = pq.ParquetFile("new_df.pq").read(){code}
The output is as follows:
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 4.79 GB, PyArrow Allocated Bytes = 4.63 GB
After run 1: RSS = 4.80 GB, PyArrow Allocated Bytes = 4.63 GB
After run 2: RSS = 4.81 GB, PyArrow Allocated Bytes = 4.63 GB
After run 3: RSS = 4.87 GB, PyArrow Allocated Bytes = 4.63 GB
After run 4: RSS = 4.91 GB, PyArrow Allocated Bytes = 4.63 GB
After run 5: RSS = 4.95 GB, PyArrow Allocated Bytes = 4.63 GB
After run 6: RSS = 5.01 GB, PyArrow Allocated Bytes = 4.63 GB
After run 7: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB
After run 8: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB
After run 9: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB{code}
table.nbytes is 4971300000 in both cases. 

The issue is still present - the memory profile of using 
ds.dataset("new_df.pq").to_table() is much higher than 
pq.ParquetFile("new_df.pq").read().

One very interesting thing is that in the pq.ParquetFile() case, removing the 
to_pandas() call has actually led to *higher* memory usage at the end of the 
call. Compare:

pq.ParquetFile("new_df.pq").read()
{code:java}
After run 0: RSS = 4.91 GB, PyArrow Allocated Bytes = 4.75 GB{code}
pq.ParquetFile("new_df.pq").read().to_pandas()
{code:java}
After run 0: RSS = 2.31 GB, PyArrow Allocated Bytes = 1.37 GB{code}
I'm guessing this is because when you introduce the to_pandas() call on same 
line, the refcount of the pyarrow.Table object goes down to 0 and it gets 
garbage collected and only the pd.DataFrame stays in memory - and somehow the 
pyarrow.Table representation of the data consumes 2x more memory than the 
pd.DataFrame representation, which also seems odd.

 

[~jorisvandenbossche] regarding what you've discovered with byte tracking - as 
you can see above, for me removing to_pandas does not reduce the reported bytes 
allocated by the pool down to 0.0 GB, it stays at 4.63 GB. Your original reply 
seems to indicate that it's a tracking problem, but the bug report you filed 
claims that the Dataset scanner always uses the jemalloc allocator. However, if 
I do pyarrow.jemalloc_memory_pool() I get the following output:
{code:java}
ArrowNotImplementedError: This Arrow build does not enable jemalloc{code}
So in my case, is it actually using the system allocator and failing to track 
bytes properly, or doing something completely different?

What else could affect behavior here? Python version? How the pyarrow package 
was installed? I'm using Python 3.8.3.

[~jorisvandenbossche] could you also post the output of your 
pq.ParquetFile().read run? The snippet you posted still shows 8 GB usage, which 
is lower than what I reported, but still almost 4x the size of the 2.12 GB 
DataFrame or the 2x the size of the 4.9 GB Table.


was (Author: JIRAUSER297458):
With the following operations (no to_pandas()) call:
{code:java}
table = ds.dataset("new_df.pq").to_table(){code}
The output is as follows:
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 5.55 GB, PyArrow Allocated Bytes = 4.63 GB
After run 1: RSS = 10.94 GB, PyArrow Allocated Bytes = 4.63 GB
After run 2: RSS = 12.55 GB, PyArrow Allocated Bytes = 4.63 GB
After run 3: RSS = 13.89 GB, PyArrow Allocated Bytes = 4.63 GB
After run 4: RSS = 14.57 GB, PyArrow Allocated Bytes = 4.63 GB
After run 5: RSS = 15.72 GB, PyArrow Allocated Bytes = 4.63 GB
After run 6: RSS = 16.62 GB, PyArrow Allocated Bytes = 4.63 GB
After run 7: RSS = 16.75 GB, PyArrow Allocated Bytes = 4.63 GB
After run 8: RSS = 17.41 GB, PyArrow Allocated Bytes = 4.63 GB
After run 9: RSS = 17.55 GB, PyArrow Allocated Bytes = 4.63 GB{code}
With the following operations (no to_pandas()) call:
{code:java}
table = pq.ParquetFile("new_df.pq").read(){code}
The output is as follows:
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 4.79 GB, PyArrow Allocated Bytes = 4.63 GB
After run 1: RSS = 4.80 GB, PyArrow Allocated Bytes = 4.63 GB
After run 2: RSS = 4.81 GB, PyArrow Allocated Bytes = 4.63 GB
After run 3: RSS = 4.87 GB, PyArrow Allocated Bytes = 4.63 GB
After run 4: RSS = 4.91 GB, PyArrow Allocated Bytes = 4.63 GB
After run 5: RSS = 4.95 GB, PyArrow Allocated Bytes = 4.63 GB
After run 6: RSS = 5.01 GB, PyArrow Allocated Bytes = 4.63 GB
After run 7: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB
After run 8: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB
After run 9: RSS = 4.97 GB, PyArrow Allocated Bytes = 4.63 GB{code}
table.nbytes is 4971300000 in both cases. 

The issue is still present - the memory profile of using 
ds.dataset("new_df.pq").to_table() is much higher than 
pq.ParquetFile("new_df.pq").read().

One very interesting thing is that in the pq.ParquetFile() case, removing the 
to_pandas() call has actually led to *higher* memory usage at the end of the 
call. Compare:

pq.ParquetFile("new_df.pq").read()
{code:java}
After run 0: RSS = 4.91 GB, PyArrow Allocated Bytes = 4.75 GB{code}
pq.ParquetFile("new_df.pq").read().to_pandas()
{code:java}
After run 0: RSS = 2.31 GB, PyArrow Allocated Bytes = 1.37 GB{code}
I'm guessing this is because when you introduce the to_pandas() call on same 
line, the refcount of the pyarrow.Table object goes down to 0 and it gets 
garbage collected and only the pd.DataFrame stays in memory - and somehow the 
pyarrow.Table representation of the data consumes 2x more memory than the 
pd.DataFrame representation, which also seems odd.

 

[~jorisvandenbossche] regarding what you've discovered with byte tracking - as 
you can see above, for me removing to_pandas does not reduce the reported bytes 
allocated by the pool down to 0.0 GB, it stays at 4.63 GB. Your original reply 
seems to indicate that it's a tracking problem, but the bug report you filed 
claims that the Dataset scanner always uses the jemalloc allocator. However, if 
I do pyarrow.jemalloc_memory_pool() I get the following output:
{code:java}
ArrowNotImplementedError: This Arrow build does not enable jemalloc{code}
So in my case, is it actually using the system allocator and failing to track 
bytes properly, or doing something completely different?

What else could affect behavior here? Python version? How the pyarrow package 
was installed? I'm using Python 3.8.3.

[~jorisvandenbossche] could you also post the output of your 
pq.ParquetFile().read run? The snippet you posted still shows 8 GB usage, which 
is lower than what I reported, but still almost 4x the size of the 2.12 GB 
DataFrame or the 4.9 GB Table.

> [Python/C++] High memory usage/potential leak when reading parquet using 
> Dataset API
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-18156
>                 URL: https://issues.apache.org/jira/browse/ARROW-18156
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet
>    Affects Versions: 4.0.1
>            Reporter: Norbert
>            Priority: Major
>
> Hi,
> I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the 
> following snippet:
>  
> {code:java}
> import os
> import pyarrow
> import pyarrow.dataset as ds
> from importlib_metadata import version
> from psutil import Process
> import pyarrow.parquet as pq
> def format_bytes(num_bytes: int):
>     return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
>  
> def main():
>     print(version("pyarrow"))
>     print(pyarrow.default_memory_pool().backend_name)
>     process = Process(os.getpid())
>     runs = 10
>     print(f"Runs: {runs}")
>     for i in range(runs):
>         dataset = ds.dataset("df.pq")
>         table = dataset.to_table()
>         df = table.to_pandas()
>         print(f"After run {i}: RSS = 
> {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes = 
> {format_bytes(pyarrow.total_allocated_bytes())}")
> {code}
>  
>  
> On PyArrow v4.0.1 the output is as follows:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code}
> If I replace ds.dataset("df.pq").to_table() with 
> pq.ParquetFile("df.pq").read(), the output is:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code}
> The usage profile of the older non-dataset API is much lower - it matches the 
> size of the dataframe much closer. It also seems like in the former example, 
> there is a memory leak? I thought that the increase in RSS was just due to 
> PyArrow's usage of jemalloc, but I seem to be using the system allocator here.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-18156) [Python/C++] High memory usage/potential leak when reading parquet using Dataset API

Reply via email to