[ 
https://issues.apache.org/jira/browse/ARROW-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dewey Dunnington reassigned ARROW-17187:
----------------------------------------

    Assignee: Dewey Dunnington

> [R] Improve lazy ALTREP implementation for String
> -------------------------------------------------
>
>                 Key: ARROW-17187
>                 URL: https://issues.apache.org/jira/browse/ARROW-17187
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Dewey Dunnington
>            Assignee: Dewey Dunnington
>            Priority: Major
>
> ARROW-16578 noted that there was a high cost to looping through an ALTREP 
> character vector that we created in the arrow R package. The temporary 
> workaround is to materialize whenever the first element is requested, which 
> is much faster than our initial implementation but is probably not necessary 
> given that other ALTREP character implementations appear to not have this 
> issue:
> (Timings before merging ARROW-16578, which reduces the 5 second operation 
> below to 0.05 seconds).
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
> for more information.
> df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
> write_parquet(df1,"/tmp/test.parquet")
> df2 <- read_parquet("/tmp/test.parquet")
> system.time(unique(df1$x))
> #>    user  system elapsed 
> #>   0.022   0.001   0.023
> system.time(unique(df2$x))
> #>    user  system elapsed 
> #>   4.529   0.680   5.226
> # the speed is almost certainly not due to ALTREP itself
> # but is probably something to do with our implementation
> tf <- tempfile()
> readr::write_csv(df1, tf)
> df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE)
> #> Rows: 1000000 Columns: 1
> #> ── Column specification 
> ────────────────────────────────────────────────────────
> #> Delimiter: ","
> #> dbl (1): x
> #> 
> #> ℹ Use `spec()` to retrieve the full column specification for this data.
> #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
> message.
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=1000000, 
> materialized=F)
> system.time(unique(df3$x))
> #>    user  system elapsed 
> #>   0.127   0.001   0.128
> .Internal(inspect(df3$x))
> #> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=1000000, 
> materialized=F)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to