[ 
https://issues.apache.org/jira/browse/ARROW-6059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896224#comment-16896224
 ] 

Francisco Sanchez commented on ARROW-6059:
------------------------------------------

[~ggGibs] I am not sure it is related to the one you mentioned, initially I 
also thought that and I tried to pass this argument set to False from the 
pandas call. It took more time but end up using the same amount of memory. I 
would think it is more related to ARROW-5965 but I don't have any evidence.

> [Python] Regression memory issue when calling pandas.read_parquet
> -----------------------------------------------------------------
>
>                 Key: ARROW-6059
>                 URL: https://issues.apache.org/jira/browse/ARROW-6059
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.14.0, 0.14.1
>            Reporter: Francisco Sanchez
>            Priority: Major
>
> I have a ~3MB parquet file with the next schema:
> {code:java}
> bag_stamp: timestamp[ns]
> transforms_[]_.header.seq: list<item: int64>
>   child 0, item: int64
> transforms_[]_.header.stamp: list<item: timestamp[ns]>
>   child 0, item: timestamp[ns]
> transforms_[]_.header.frame_id: list<item: string>
>   child 0, item: string
> transforms_[]_.child_frame_id: list<item: string>
>   child 0, item: string
> transforms_[]_.transform.translation.x: list<item: double>
>   child 0, item: double
> transforms_[]_.transform.translation.y: list<item: double>
>   child 0, item: double
> transforms_[]_.transform.translation.z: list<item: double>
>   child 0, item: double
> transforms_[]_.transform.rotation.x: list<item: double>
>   child 0, item: double
> transforms_[]_.transform.rotation.y: list<item: double>
>   child 0, item: double
> transforms_[]_.transform.rotation.z: list<item: double>
>   child 0, item: double
> transforms_[]_.transform.rotation.w: list<item: double>
>   child 0, item: double
> {code}
>  If I read it with *pandas.read_parquet()* using pyarrow 0.13.0 all seems 
> fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it 
> takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough 
> available memory it will just be killed OOM. Now, if I use the next code 
> snippet instead it works perfectly with all the versions:
> {code}
> parquet_file = pq.ParquetFile(input_file)
> tables = []
> for row_group in range(parquet_file.num_row_groups):
>     tables.append(parquet_file.read_row_group(row_group, columns=columns, 
> use_pandas_metadata=True))
> df = pa.concat_tables(tables).to_pandas()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to