[jira] [Commented] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

JIRA Fri, 02 Aug 2019 15:27:11 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899250#comment-16899250
 ]


Robin Kåveland commented on ARROW-6060:
---------------------------------------

I've had to downgrade our VMs to 0.13.0 today, I was observing parquet files 
that we could load just fine with 16GB of RAM fail to load using VMs with 28GB 
of RAM. Unfortunately, I can't disclose any of the data either. We are using 
{{parquet.ParquetDataset.read()}}, but observe the problem even if we read 
single pieces of the parquet data sets (the pieces are between 100MB and 
200MB). Most of our columns are unicode and probably would be friendly to 
dictionary encoding. The files have been written by spark. Normally, these 
datasets would take a while to load, so memory consumption would grow steadily 
for ~10 seconds, but now it seems like we invoke the OOM-killer in only a few 
seconds, so allocation seems very spiky.

> [Python] too large memory cost using pyarrow.parquet.read_table with 
> use_threads=True
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-6060
>                 URL: https://issues.apache.org/jira/browse/ARROW-6060
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.1
>            Reporter: Kun Liu
>            Priority: Major
>
>  I tried to load a parquet file of about 1.8Gb using the following code. It 
> crashed due to out of memory issue.
> {code:java}
> import pyarrow.parquet as pq
> pq.read_table('/tmp/test.parquet'){code}
>  However, it worked well with use_threads=True as follows
> {code:java}
> pq.read_table('/tmp/test.parquet', use_threads=False){code}
> If pyarrow is downgraded to 0.12.1, there is no such problem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-6060) [Python] too large memory cost using pyarrow.parquet.read_table with use_threads=True

Reply via email to