[
https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899250#comment-16899250
]
Robin Kåveland commented on ARROW-6060:
---------------------------------------
I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files
that we could load just fine with 16GB of RAM fail to load using VMs with 28GB
of RAM. Unfortunately, I can't disclose any of the data either. We are using
{{parquet.ParquetDataset.read()}}, but observe the problem even if we read
single pieces of the parquet data sets (the pieces are between 100MB and
200MB). Most of our columns are unicode and probably would be friendly to
dictionary encoding. The files have been written by spark. Normally, these
datasets would take a while to load, so memory consumption would grow steadily
for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few
seconds, so allocation seems very spiky.
> [Python] too large memory cost using pyarrow.parquet.read_table with
> use_threads=True
> -------------------------------------------------------------------------------------
>
> Key: ARROW-6060
> URL: https://issues.apache.org/jira/browse/ARROW-6060
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.14.1
> Reporter: Kun Liu
> Priority: Major
>
> I tried to load a parquet file of about 1.8Gb using the following code. It
> crashed due to out of memory issue.
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('/tmp/test.parquet'){code}
> However, it worked well with use_threads=True as follows
> {code:java}
> pq.read_table('/tmp/test.parquet', use_threads=False){code}
> If pyarrow is downgraded to 0.12.1, there is no such problem.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)