[
https://issues.apache.org/jira/browse/ARROW-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624296#comment-17624296
]
Joris Van den Bossche commented on ARROW-18156:
-----------------------------------------------
I also cannot reproduce this on pyarrow 4.0:
{code}
In [8]: main()
jemalloc
Runs: 5
After run 0: RSS = 6.83 GB, PyArrow Allocated Bytes = 5.94 GB
After run 1: RSS = 8.08 GB, PyArrow Allocated Bytes = 5.94 GB
After run 2: RSS = 8.09 GB, PyArrow Allocated Bytes = 5.94 GB
After run 3: RSS = 8.09 GB, PyArrow Allocated Bytes = 5.94 GB
After run 4: RSS = 8.10 GB, PyArrow Allocated Bytes = 5.94 GB
In [9]: pa.set_memory_pool(pa.system_memory_pool())
In [10]: main()
system
Runs: 5
After run 0: RSS = 6.81 GB, PyArrow Allocated Bytes = 1.31 GB
After run 1: RSS = 8.10 GB, PyArrow Allocated Bytes = 1.31 GB
After run 2: RSS = 7.88 GB, PyArrow Allocated Bytes = 1.31 GB
After run 3: RSS = 8.09 GB, PyArrow Allocated Bytes = 1.31 GB
After run 4: RSS = 8.09 GB, PyArrow Allocated Bytes = 1.31 GB
{code}
And also checked with the system allocator (as you were using that), and it
also doesn't reproduce the issue. But I do note a curious difference in pyarrow
allocated bytes, which might also be related to the difference you see for that
between the datasets API and the ParquetFile API.
It seems that with the system memory pool, we don't track the data that is
created by the dataset scanner, but only the additional data created in the
{{to_pandas}} call (if I remove {{to_pandas}} in the example, the system pool
indicates 0.0 GB is allocated). [~westonpace] that seems a bug?
[~Norbo11] as Weston mentioned, I would indeed try again without the
{{to_pandas}} call (that is the more expensive part here), to see if that makes
a difference
> [Python/C++] High memory usage/potential leak when reading parquet using
> Dataset API
> ------------------------------------------------------------------------------------
>
> Key: ARROW-18156
> URL: https://issues.apache.org/jira/browse/ARROW-18156
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet
> Affects Versions: 4.0.1
> Reporter: Norbert
> Priority: Major
>
> Hi,
> I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the
> following snippet:
>
> {code:java}
> import os
> import pyarrow
> import pyarrow.dataset as ds
> from importlib_metadata import version
> from psutil import Process
> import pyarrow.parquet as pq
> def format_bytes(num_bytes: int):
> return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
>
> def main():
> print(version("pyarrow"))
> print(pyarrow.default_memory_pool().backend_name)
> process = Process(os.getpid())
> runs = 10
> print(f"Runs: {runs}")
> for i in range(runs):
> dataset = ds.dataset("df.pq")
> table = dataset.to_table()
> df = table.to_pandas()
> print(f"After run {i}: RSS =
> {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes =
> {format_bytes(pyarrow.total_allocated_bytes())}")
> {code}
>
>
> On PyArrow v4.0.1 the output is as follows:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code}
> If I replace ds.dataset("df.pq").to_table() with
> pq.ParquetFile("df.pq").read(), the output is:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code}
> The usage profile of the older non-dataset API is much lower - it matches the
> size of the dataframe much closer. It also seems like in the former example,
> there is a memory leak? I thought that the increase in RSS was just due to
> PyArrow's usage of jemalloc, but I seem to be using the system allocator here.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)