[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

Joris Van den Bossche (Jira) Fri, 25 Oct 2019 04:06:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959653#comment-16959653
 ]


Joris Van den Bossche commented on ARROW-6985:
----------------------------------------------

[~CHDev93] thanks for the report. There was a performance regression regarding 
parquet files with many columns in 0.15.0 (see ARROW-6876, fixed on master and 
will shortly be released as 0.15.1). So that could clarify at least a general 
slowdown. 

How much do you see it slow down during the loop? 
I ran your code and possibly see some slowdown (max 2x), but it's a bit noisy.

> [Python] Steadily increasing time to load file using read_parquet
> -----------------------------------------------------------------
>
>                 Key: ARROW-6985
>                 URL: https://issues.apache.org/jira/browse/ARROW-6985
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0, 0.14.0, 0.15.0
>            Reporter: Casey
>            Priority: Minor
>
> I've noticed that reading from parquet using pandas read_parquet function is 
> taking steadily longer with each invocation. I've seen the other ticket about 
> memory usage but I'm seeing no memory impact just steadily increasing read 
> time until I restart the python session.
> Below is some code to reproduce my results. I notice it's particularly bad on 
> wide matrices, especially using pyarrow==0.15.0
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pandas as pd
> import os
> import numpy as np
> import time
> file = "skinny_matrix.pq"
> if not os.path.isfile(file):
>     mat = np.zeros((6000, 26000))
>     mat.ravel()[::100] = np.random.randn(60 * 26000)
>     df = pd.DataFrame(mat.T)
>     table = pa.Table.from_pandas(df)
>     pq.write_table(table, file)
> n_timings = 50
> timings = np.empty(n_timings)
> for i in range(n_timings):
>     start = time.time()
>     new_df = pd.read_parquet(file)
>     end = time.time()
>     timings[i] = end - start
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6985) [Python] Steadily increasing time to load file using read_parquet

Reply via email to