Ivan SPM created ARROW-4470:
-------------------------------

             Summary: Pyarrow using considerable more memory when reading 
partitioned Parquet file
                 Key: ARROW-4470
                 URL: https://issues.apache.org/jira/browse/ARROW-4470
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.12.0
            Reporter: Ivan SPM


Hi,

I have a partitioned Parquet table in Impala in HDFS, using Hive metastore, 
with the following structure:

{{/data/myparquettable/year=2016}}{{/data/myparquettable/year=2016/myfile_1.prt}}

{{/data/myparquettable/year=2016/myfile_2.prt}}

{{/data/myparquettable/year=2016/myfile_3.prt}}

{{/data/myparquettable/year=2017}}

{{/data/myparquettable/year=2017/myfile_1.prt}}

{{/data/myparquettable/year=2017/myfile_2.prt}}

{{/data/myparquettable/year=2017/myfile_3.prt}}

and so on. I need to work with one partition, so I copied one partition to a 
local filesystem:

{{hdfs fs -get /data/myparquettable/year=2017 /local/}}

so now I have some data on the local disk:

{{/local/year=2017/myfile_1.prt }}{{/local/year=2017/myfile_2.prt }}

etc.I tried to read it using Pyarrow:

{{import pyarrow.parquet as pq}}{{pq.read_parquet('/local/year=2017')}}

and it starts reading. The problem is that the local Parquet files are around 
15GB total, and I blew up my machine memory a couple of times because when 
reading these files, Pyarrow is using more than 60GB of RAM, and I'm not sure 
how much it will take because it never finishes. Is this expected? Is there a 
workaround?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to