[
https://issues.apache.org/jira/browse/ARROW-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165797#comment-17165797
]
Neal Richardson commented on ARROW-9557:
----------------------------------------
Hmm, I wouldn't expect that big of a difference. Reading through the source,
the Python and R bindings look like they're doing about the same thing, but
there must be something they're doing differently (assuming that
{{"test_tab.pq"}} in your faster Python example is a typo and not secretly a
much smaller file ;)
Would you mind profiling the read_by_column function to see what it's doing in
that 100 seconds? My guess is that there's something it's doing inside the loop
that seems cheap enough but adds up when you call it 4000 times.
> [R] Iterating over parquet columns is slow in R
> -----------------------------------------------
>
> Key: ARROW-9557
> URL: https://issues.apache.org/jira/browse/ARROW-9557
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Affects Versions: 1.0.0
> Reporter: Karl Dunkle Werner
> Priority: Minor
> Labels: performance
>
> I've found that reading in a parquet file one column at a time is slow in R –
> much slower than reading the whole all at once in R, or reading one column at
> a time in Python.
> An example is below, though it's certainly possible I've done my benchmarking
> incorrectly.
>
> Python setup and benchmarking:
> {code:python}
> import numpy as np
> import pyarrow
> import pyarrow.parquet as pq
> from numpy.random import default_rng
> from time import time
> # Create a large, random array to save. ~1.5 GB.
> rng = default_rng(seed = 1)
> n_col = 4000
> n_row = 50000
> mat = rng.standard_normal((n_col, n_row))
> col_names = [str(nm) for nm in range(n_col)]
> tab = pyarrow.Table.from_arrays(mat, names=col_names)
> pq.write_table(tab, "test_tab.parquet", use_dictionary=False)
> # How long does it take to read the whole thing in python?
> time_start = time()
> _ = pq.read_table("test_tab.parquet")
> elapsed = time() - time_start
> print(elapsed) # under 1 second on my computer
> time_start = time()
> f = pq.ParquetFile("test_tab.pq")
> for one_col in col_names:
> _ = f.read(one_col).column(0)
> elapsed = time() - time_start
> print(elapsed) # about 2 seconds
> {code}
> R benchmarking, using the same {{test_tab.parquet}} file
> {code:r}
> library(arrow)
> read_by_column <- function(f) {
> table = ParquetFileReader$create(f)
> cols <- as.character(0:3999)
> purrr::walk(cols, ~table$ReadTable(.)$column(0))
> }
> bench::mark(
> read_parquet("test_tab.parquet", as_data_frame=FALSE), # 0.6 s
> read_parquet("test_tab.parquet", as_data_frame=TRUE), # 1 s
> read_by_column("test_tab.parquet"), # 100 s
> check=FALSE
> )
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)