Francisco Sanchez created ARROW-6059: ----------------------------------------
Summary: Regression memory issue when calling pandas.read_parquet Key: ARROW-6059 URL: https://issues.apache.org/jira/browse/ARROW-6059 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1, 0.14.0 Reporter: Francisco Sanchez I have a ~2MB parquet file with the next schema: {code:java} bag_stamp: timestamp[ns] transforms_[]_.header.seq: list<item: int64> child 0, item: int64 transforms_[]_.header.stamp: list<item: timestamp[ns]> child 0, item: timestamp[ns] transforms_[]_.header.frame_id: list<item: string> child 0, item: string transforms_[]_.child_frame_id: list<item: string> child 0, item: string transforms_[]_.transform.translation.x: list<item: double> child 0, item: double transforms_[]_.transform.translation.y: list<item: double> child 0, item: double transforms_[]_.transform.translation.z: list<item: double> child 0, item: double transforms_[]_.transform.rotation.x: list<item: double> child 0, item: double transforms_[]_.transform.rotation.y: list<item: double> child 0, item: double transforms_[]_.transform.rotation.z: list<item: double> child 0, item: double transforms_[]_.transform.rotation.w: list<item: double> child 0, item: double {code} If I read it with *pandas.read_parquet()* using pyarrow 0.13.0 all seems fine and it takes no time to load. If I try the same with 0.14.0 or 0.14.1 it takes a lot of time and uses ~10GB of RAM. Many times if I don't have enough available memory it will just be killed OOM. Now, if I use the next code snippet instead it works perfectly with all the versions: {code:java} parquet_file = pq.ParquetFile(input_file) tables = [] for row_group in range(parquet_file.num_row_groups): tables.append(parquet_file.read_row_group(row_group, columns=columns, use_pandas_metadata=True)) df = pa.concat_tables(tables).to_pandas() {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)