[
https://issues.apache.org/jira/browse/ARROW-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-17187:
-----------------------------------
Labels: pull-request-available (was: )
> [R] Improve lazy ALTREP implementation for String
> -------------------------------------------------
>
> Key: ARROW-17187
> URL: https://issues.apache.org/jira/browse/ARROW-17187
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Dewey Dunnington
> Assignee: Dewey Dunnington
> Priority: Major
> Labels: pull-request-available
> Fix For: 10.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> ARROW-16578 noted that there was a high cost to looping through an ALTREP
> character vector that we created in the arrow R package. The temporary
> workaround is to materialize whenever the first element is requested, which
> is much faster than our initial implementation but is probably not necessary
> given that other ALTREP character implementations appear to not have this
> issue:
> (Timings before merging ARROW-16578, which reduces the 5 second operation
> below to 0.05 seconds).
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()`
> for more information.
> df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
> write_parquet(df1,"/tmp/test.parquet")
> df2 <- read_parquet("/tmp/test.parquet")
> system.time(unique(df1$x))
> #> user system elapsed
> #> 0.022 0.001 0.023
> system.time(unique(df2$x))
> #> user system elapsed
> #> 4.529 0.680 5.226
> # the speed is almost certainly not due to ALTREP itself
> # but is probably something to do with our implementation
> tf <- tempfile()
> readr::write_csv(df1, tf)
> df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE)
> #> Rows: 1000000 Columns: 1
> #> ── Column specification
> ────────────────────────────────────────────────────────
> #> Delimiter: ","
> #> dbl (1): x
> #>
> #> ℹ Use `spec()` to retrieve the full column specification for this data.
> #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
> message.
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=1000000,
> materialized=F)
> system.time(unique(df3$x))
> #> user system elapsed
> #> 0.127 0.001 0.128
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=1000000,
> materialized=F)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)