[jira] [Commented] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory

Wes McKinney (JIRA) Fri, 12 May 2017 10:57:16 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008481#comment-16008481
 ]


Wes McKinney commented on ARROW-1017:
-------------------------------------

Thanks [~jporritt], I will take a look and see if I can reproduce the issue. 

> Python: Calling to_pandas on a Parquet file in HDFS leaks memory
> ----------------------------------------------------------------
>
>                 Key: ARROW-1017
>                 URL: https://issues.apache.org/jira/browse/ARROW-1017
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.3.0
>            Reporter: James Porritt
>             Fix For: 0.4.0
>
>
> Running the following code results in ever increasing memory usage, even 
> though I would expect the dataframe to be garbage collected when it goes out 
> of scope. For the size of my parquet file, I see the usage increasing about 
> 3GB per loop:
> {code}
> from pyarrow import HdfsClient
> def read_parquet_file(client, parquet_file):
>     parquet = client.read_parquet(parquet_file)
>     df = parquet.to_pandas()
> client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
> parquet_file = '/my/parquet/file
> while True:
>     read_parquet_file(client, parquet_file)
> {code}
> Is there a reference count issue similar to ARROW-362?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory

Reply via email to