Dewey Dunnington created ARROW-17187:
----------------------------------------

             Summary: [R] Improve lazy ALTREP implementation for String
                 Key: ARROW-17187
                 URL: https://issues.apache.org/jira/browse/ARROW-17187
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
            Reporter: Dewey Dunnington


ARROW-16578 noted that there was a high cost to looping through an ALTREP 
character vector that we created in the arrow R package. The temporary 
workaround is to materialize whenever the first element is requested, which is 
much faster than our initial implementation but is probably not necessary given 
that other ALTREP character implementations appear to not have this issue:

(Timings before merging ARROW-16578, which reduces the 5 second operation below 
to 0.05 seconds).

{code:R}
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.

df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
write_parquet(df1,"/tmp/test.parquet")
df2 <- read_parquet("/tmp/test.parquet")
system.time(unique(df1$x))
#>    user  system elapsed 
#>   0.022   0.001   0.023
system.time(unique(df2$x))
#>    user  system elapsed 
#>   4.529   0.680   5.226

# the speed is almost certainly not due to ALTREP itself
# but is probably something to do with our implementation
tf <- tempfile()
readr::write_csv(df1, tf)
df3 <- vroom::vroom(tf, delim = ",", altrep = TRUE)
#> Rows: 1000000 Columns: 1
#> ── Column specification 
────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (1): x
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
.Internal(inspect(df3$x))
#> @2d2042048 14 REALSXP g0c0 [REF(65535)] vroom_dbl (len=1000000, 
materialized=F)
system.time(unique(df3$x))
#>    user  system elapsed 
#>   0.127   0.001   0.128
.Internal(inspect(df3$x))
#> @2d2042048 14 REALSXP g1c0 [MARK,REF(65535)] vroom_dbl (len=1000000, 
materialized=F)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to