[ https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959653#comment-16959653 ]
Joris Van den Bossche commented on ARROW-6985: ---------------------------------------------- [~CHDev93] thanks for the report. There was a performance regression regarding parquet files with many columns in 0.15.0 (see ARROW-6876, fixed on master and will shortly be released as 0.15.1). So that could clarify at least a general slowdown. How much do you see it slow down during the loop? I ran your code and possibly see some slowdown (max 2x), but it's a bit noisy. > [Python] Steadily increasing time to load file using read_parquet > ----------------------------------------------------------------- > > Key: ARROW-6985 > URL: https://issues.apache.org/jira/browse/ARROW-6985 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.13.0, 0.14.0, 0.15.0 > Reporter: Casey > Priority: Minor > > I've noticed that reading from parquet using pandas read_parquet function is > taking steadily longer with each invocation. I've seen the other ticket about > memory usage but I'm seeing no memory impact just steadily increasing read > time until I restart the python session. > Below is some code to reproduce my results. I notice it's particularly bad on > wide matrices, especially using pyarrow==0.15.0 > {code:python} > import pyarrow.parquet as pq > import pyarrow as pa > import pandas as pd > import os > import numpy as np > import time > file = "skinny_matrix.pq" > if not os.path.isfile(file): > mat = np.zeros((6000, 26000)) > mat.ravel()[::100] = np.random.randn(60 * 26000) > df = pd.DataFrame(mat.T) > table = pa.Table.from_pandas(df) > pq.write_table(table, file) > n_timings = 50 > timings = np.empty(n_timings) > for i in range(n_timings): > start = time.time() > new_df = pd.read_parquet(file) > end = time.time() > timings[i] = end - start > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)