[
https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899511#comment-16899511
]
Wes McKinney commented on ARROW-6060:
-------------------------------------
I confirmed the peak memory use problem with the following code (thanks for the
help reproducing!):
{code}
import pandas as pd
from pandas.util.testing import rands
import pyarrow as pa
import pyarrow.parquet as pq
import gc
class memory_use:
def __init__(self):
self.start_use = pa.total_allocated_bytes()
self.pool = pa.default_memory_pool()
self.start_peak_use = self.pool.max_memory()
def __enter__(self):
return
def __exit__(self, type, value, traceback):
gc.collect()
print("Change in memory use: {}"
.format(pa.total_allocated_bytes() - self.start_use))
print("Change in peak use: {}"
.format(self.pool.max_memory() - self.start_peak_use))
def generate_strings(length, nunique, string_length=10):
unique_values = [rands(string_length) for i in range(nunique)]
values = unique_values * (length // nunique)
return values
df = pd.DataFrame()
df['a'] = generate_strings(100000000, 10000)
df['b'] = generate_strings(100000000, 10000)
df.to_parquet('/tmp/test.parquet')
with memory_use():
table = pq.read_table('/tmp/test.parquet')
{code}
I have with 0.13.0:
{code}
Change in memory use: 2825000192
Change in peak use: 3827684608
{code}
and with 0.14.1 and master
{code}
Change in memory use: 2825000192
Change in peak use: 20585786752
{code}
So peak memory use is about 20GB now where it was less than 4GB before. I'm not
sure which patch caused this but there have been a _lot_ of patches related to
builders in the last several months so my guess is that one of the builders has
a bug in its memory allocation logic
cc [~bkietz] [~pitrou] [~npr]
> [Python] too large memory cost using pyarrow.parquet.read_table with
> use_threads=True
> -------------------------------------------------------------------------------------
>
> Key: ARROW-6060
> URL: https://issues.apache.org/jira/browse/ARROW-6060
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.14.1
> Reporter: Kun Liu
> Priority: Major
>
> I tried to load a parquet file of about 1.8Gb using the following code. It
> crashed due to out of memory issue.
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('/tmp/test.parquet'){code}
> However, it worked well with use_threads=True as follows
> {code:java}
> pq.read_table('/tmp/test.parquet', use_threads=False){code}
> If pyarrow is downgraded to 0.12.1, there is no such problem.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)