[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

Francois Saint-Jacques (Jira) Wed, 21 Aug 2019 09:33:19 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-4470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Francois Saint-Jacques updated ARROW-4470:
------------------------------------------
    Labels: dataset datasets parquet  (was: datasets parquet)

> [Python] Pyarrow using considerable more memory when reading partitioned 
> Parquet file
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-4470
>                 URL: https://issues.apache.org/jira/browse/ARROW-4470
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.0
>            Reporter: Ivan SPM
>            Priority: Major
>              Labels: dataset, datasets, parquet
>             Fix For: 1.0.0
>
>
> Hi,
> I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, 
> with the following structure:
> {{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}}
> {{/data/myparquettable/year=2016/myfile_2.prt}}
> {{/data/myparquettable/year=2016/myfile_3.prt}}
> {{/data/myparquettable/year=2017}}
> {{/data/myparquettable/year=2017/myfile_1.prt}}
> {{/data/myparquettable/year=2017/myfile_2.prt}}
> {{/data/myparquettable/year=2017/myfile_3.prt}}
> and so on. I need to work with one partition, so I copied one partition to a 
> local filesystem:
> {{hdfs fs -get /data/myparquettable/year=2017 /local/}}
> so now I have some data on the local disk:
> {{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}
> etc.I tried to read it using Pyarrow:
> {{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}}
> and it starts reading. The problem is that the local Parquet files are around 
> 15GB total, and I blew up my machine memory a couple of times because when 
> reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure 
> how much it will take because it never finishes. Is this expected? Is there a 
> workaround?
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ARROW-4470) [Python] Pyarrow using considerable more memory when reading partitioned Parquet file

Reply via email to