[
https://issues.apache.org/jira/browse/ARROW-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624593#comment-17624593
]
Weston Pace edited comment on ARROW-18156 at 10/26/22 4:38 PM:
---------------------------------------------------------------
{quote}
What else could affect behavior here? Python version? How the pyarrow package
was installed? I'm using Python 3.8.3.
{quote}
Given this is the system allocator the OS and glibc version might be more
significant than the python version. This could be a pip vs. conda thing also.
I will try and test out pip. The dataset version of scanning will generate a
lot of temporary allocations (probably too many but that is another story). It
seems the allocator is not releasing the RAM.
For reference, here are the results I get. I'm using python 3.9.7 with Arrow
4.0.0 from conda forge.
{noformat}
mode=dataset allocator=jemalloc
jemalloc
Runs: 10
After run 0: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{noformat}
{noformat}
mode=ParquetFile.read() allocator=jemalloc
jemalloc
Runs: 10
After run 0: RSS = 6.85 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{noformat}
{noformat}
mode=dataset allocator=jemalloc
system
Runs: 10
After run 0: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{noformat}
{noformat}
mode=ParquetFile.read() allocator=system
system
Runs: 10
After run 0: RSS = 6.84 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{noformat}
Note, I am working around the tracking bug [~jorisvandenbossche] mentioned by
setting the default pool using an environment variable.
was (Author: westonpace):
{quote}
What else could affect behavior here? Python version? How the pyarrow package
was installed? I'm using Python 3.8.3.
{quote}
Given this is the system allocator the OS and glibc version might be more
significant than the python version. This could be a pip vs. conda thing also.
I will try and test out pip. The dataset version of scanning will generate a
lot of temporary allocations (probably too many but that is another story). It
seems the allocator is not releasing the RAM.
For reference, here are the results I get. I'm using python 3.9.7 with Arrow
4.0.0 from conda forge.
{{noformat}}
mode=dataset allocator=jemalloc
jemalloc
Runs: 10
After run 0: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 9.46 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{{noformat}}
{{noformat}}
mode=ParquetFile.read() allocator=jemalloc
jemalloc
Runs: 10
After run 0: RSS = 6.85 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{{noformat}}
{{noformat}}
mode=dataset allocator=jemalloc
system
Runs: 10
After run 0: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 9.47 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 9.48 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{{noformat}}
{{noformat}}
mode=ParquetFile.read() allocator=system
system
Runs: 10
After run 0: RSS = 6.84 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 1: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 2: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 3: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 4: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 5: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 6: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 7: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 8: RSS = 8.15 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
After run 9: RSS = 8.16 GB, PyArrow Allocated Bytes = 5.94 GB Table Nbytes =
4.63 GB
{{noformat}}
Note, I am working around the tracking bug [~jorisvandenbossche] mentioned by
setting the default pool using an environment variable.
> [Python/C++] High memory usage/potential leak when reading parquet using
> Dataset API
> ------------------------------------------------------------------------------------
>
> Key: ARROW-18156
> URL: https://issues.apache.org/jira/browse/ARROW-18156
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet
> Affects Versions: 4.0.1
> Reporter: Norbert
> Priority: Major
>
> Hi,
> I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the
> following snippet:
>
> {code:java}
> import os
> import pyarrow
> import pyarrow.dataset as ds
> from importlib_metadata import version
> from psutil import Process
> import pyarrow.parquet as pq
> def format_bytes(num_bytes: int):
> return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
>
> def main():
> print(version("pyarrow"))
> print(pyarrow.default_memory_pool().backend_name)
> process = Process(os.getpid())
> runs = 10
> print(f"Runs: {runs}")
> for i in range(runs):
> dataset = ds.dataset("df.pq")
> table = dataset.to_table()
> df = table.to_pandas()
> print(f"After run {i}: RSS =
> {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes =
> {format_bytes(pyarrow.total_allocated_bytes())}")
> {code}
>
>
> On PyArrow v4.0.1 the output is as follows:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code}
> If I replace ds.dataset("df.pq").to_table() with
> pq.ParquetFile("df.pq").read(), the output is:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code}
> The usage profile of the older non-dataset API is much lower - it matches the
> size of the dataframe much closer. It also seems like in the former example,
> there is a memory leak? I thought that the increase in RSS was just due to
> PyArrow's usage of jemalloc, but I seem to be using the system allocator here.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)