[ 
https://issues.apache.org/jira/browse/ARROW-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166611#comment-17166611
 ] 

Neal Richardson commented on ARROW-9557:
----------------------------------------

Interesting, thanks for sharing. Will definitely take a look.

FYI [~romainfrancois]

> [R] Iterating over parquet columns is slow in R
> -----------------------------------------------
>
>                 Key: ARROW-9557
>                 URL: https://issues.apache.org/jira/browse/ARROW-9557
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 1.0.0
>            Reporter: Karl Dunkle Werner
>            Priority: Minor
>              Labels: performance
>         Attachments: profile_screenshot.png
>
>
> I've found that reading in a parquet file one column at a time is slow in R – 
> much slower than reading the whole all at once in R, or reading one column at 
> a time in Python.
> An example is below, though it's certainly possible I've done my benchmarking 
> incorrectly.
>  
> Python setup and benchmarking:
> {code:python}
> import numpy as np
> import pyarrow
> import pyarrow.parquet as pq
> from numpy.random import default_rng
> from time import time
> # Create a large, random array to save. ~1.5 GB.
> rng = default_rng(seed = 1)
> n_col = 4000
> n_row = 50000
> mat = rng.standard_normal((n_col, n_row))
> col_names = [str(nm) for nm in range(n_col)]
> tab = pyarrow.Table.from_arrays(mat, names=col_names)
> pq.write_table(tab, "test_tab.parquet", use_dictionary=False)
> # How long does it take to read the whole thing in python?
> time_start = time()
> _ = pq.read_table("test_tab.parquet") # edit: corrected filename
> elapsed = time() - time_start
> print(elapsed) # under 1 second on my computer
> time_start = time()
> f = pq.ParquetFile("test_tab.parquet")
> for one_col in col_names:
>     _ = f.read(one_col).column(0)
> elapsed = time() - time_start
> print(elapsed) # about 2 seconds
> {code}
> R benchmarking, using the same {{test_tab.parquet}} file
> {code:r}
> library(arrow)
> read_by_column <- function(f) {
>     table = ParquetFileReader$create(f)
>     cols <- as.character(0:3999)
>     purrr::walk(cols, ~table$ReadTable(.)$column(0))
> }
> bench::mark(
>     read_parquet("test_tab.parquet", as_data_frame=FALSE), #   0.6 s
>     read_parquet("test_tab.parquet", as_data_frame=TRUE),  #   1 s
>     read_by_column("test_tab.parquet"),                    # 100 s
>     check=FALSE
> )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to