[ 
https://issues.apache.org/jira/browse/ARROW-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166497#comment-17166497
 ] 

Karl Dunkle Werner commented on ARROW-9557:
-------------------------------------------

It looks like {{shared_ptr_is_null}} is taking half the time, and 
{{vars_select_eval}} is taking another third.

I attached a screenshot of benchmark results from running the code below. (No 
difference from the original code I posted, except {{read_one}} is broken out 
into its own function.)
{code:r}
library(arrow)

read_one <- function(col, table) {
  x = table$ReadTable(col)
  x$column(0)
}

read_by_column <- function(f) {
  table = ParquetFileReader$create(f)
  cols <- as.character(0:3999)
  purrr::walk(cols, read_one, table=table)
}

read_by_column("test_tab.parquet")

{code}

> [R] Iterating over parquet columns is slow in R
> -----------------------------------------------
>
>                 Key: ARROW-9557
>                 URL: https://issues.apache.org/jira/browse/ARROW-9557
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 1.0.0
>            Reporter: Karl Dunkle Werner
>            Priority: Minor
>              Labels: performance
>         Attachments: profile_screenshot.png
>
>
> I've found that reading in a parquet file one column at a time is slow in R – 
> much slower than reading the whole all at once in R, or reading one column at 
> a time in Python.
> An example is below, though it's certainly possible I've done my benchmarking 
> incorrectly.
>  
> Python setup and benchmarking:
> {code:python}
> import numpy as np
> import pyarrow
> import pyarrow.parquet as pq
> from numpy.random import default_rng
> from time import time
> # Create a large, random array to save. ~1.5 GB.
> rng = default_rng(seed = 1)
> n_col = 4000
> n_row = 50000
> mat = rng.standard_normal((n_col, n_row))
> col_names = [str(nm) for nm in range(n_col)]
> tab = pyarrow.Table.from_arrays(mat, names=col_names)
> pq.write_table(tab, "test_tab.parquet", use_dictionary=False)
> # How long does it take to read the whole thing in python?
> time_start = time()
> _ = pq.read_table("test_tab.parquet") # edit: corrected filename
> elapsed = time() - time_start
> print(elapsed) # under 1 second on my computer
> time_start = time()
> f = pq.ParquetFile("test_tab.parquet")
> for one_col in col_names:
>     _ = f.read(one_col).column(0)
> elapsed = time() - time_start
> print(elapsed) # about 2 seconds
> {code}
> R benchmarking, using the same {{test_tab.parquet}} file
> {code:r}
> library(arrow)
> read_by_column <- function(f) {
>     table = ParquetFileReader$create(f)
>     cols <- as.character(0:3999)
>     purrr::walk(cols, ~table$ReadTable(.)$column(0))
> }
> bench::mark(
>     read_parquet("test_tab.parquet", as_data_frame=FALSE), #   0.6 s
>     read_parquet("test_tab.parquet", as_data_frame=TRUE),  #   1 s
>     read_by_column("test_tab.parquet"),                    # 100 s
>     check=FALSE
> )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to