[GitHub] [arrow] romainfrancois commented on pull request #11225: ARROW-13112: [R] altrep vectors for strings and other types

GitBox Tue, 28 Sep 2021 01:40:41 -0700


romainfrancois commented on pull request #11225:
URL: https://github.com/apache/arrow/pull/11225#issuecomment-928981161



   @jonkeane Getting this on master: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   #> See arrow_info() for available features
   library(dplyr, warn.conflicts = FALSE)
   
   inject_na <- function(x) {
     x[sample(seq_along(x), length(x) / 10)] <- NA
     x
   }
   
   df <- data.frame(
     x = runif(1e6), 
     y = sample(1:100, 1e6, replace = TRUE), 
     z = sample(month.abb, 1e6, replace = TRUE)
   ) %>% 
     mutate(
       across(everything(), inject_na, .names = "{.col}_na")
     )
   
   a <- Array$create(df)
   a$type
   #> StructType
   #> struct<x: double, y: int32, z: string, x_na: double, y_na: int32, z_na: 
string>
   
   bench::workout({
     v <- a$as_vector()
     v$x
     v$y
     v$z
     v$x_na
     v$y_na
     v$z_na
   })
   #> # A tibble: 7 × 3
   #>   exprs               process     real
   #>   <bch:expr>         <bch:tm> <bch:tm>
   #> 1 v <- a$as_vector()    148ms 148.53ms
   #> 2 v$x                    49µs  50.43µs
   #> 3 v$y                     3µs    4.9µs
   #> 4 v$z                     4µs   4.24µs
   #> 5 v$x_na                  4µs   3.89µs
   #> 6 v$y_na                  3µs   3.63µs
   #> 7 v$z_na                  3µs   3.97µs
   
   # get few values
   bench::workout({
     v$x[1:10]
     v$y[1:10]
     v$z[1:10]
     v$x_na[1:10]
     v$y_na[1:10]
     v$z_na[1:10]
   })
   #> # A tibble: 6 × 3
   #>   exprs         process     real
   #>   <bch:expr>   <bch:tm> <bch:tm>
   #> 1 v$x[1:10]        11µs  12.71µs
   #> 2 v$y[1:10]         5µs   5.96µs
   #> 3 v$z[1:10]         4µs   5.16µs
   #> 4 v$x_na[1:10]      4µs   4.79µs
   #> 5 v$y_na[1:10]      4µs   4.48µs
   #> 6 v$z_na[1:10]      4µs   4.82µs
   
   # duplicate -> materialize
   bench::workout({
     v$x[]
     v$y[]
     v$z[]
     v$x_na[]
     v$y_na[]
     v$z_na[]
   })
   #> # A tibble: 6 × 3
   #>   exprs       process     real
   #>   <bch:expr> <bch:tm> <bch:tm>
   #> 1 v$x[]        3.71ms   3.71ms
   #> 2 v$y[]        8.95ms   8.95ms
   #> 3 v$z[]        9.43ms   9.43ms
   #> 4 v$x_na[]    67.23ms  67.32ms
   #> 5 v$y_na[]      505µs 508.25µs
   #> 6 v$z_na[]     4.55ms   4.56ms
   ```
   
   <sup>Created on 2021-09-28 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.0)</sup>
   
   vs this on, this on this pull request: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   #> See arrow_info() for available features
   library(dplyr, warn.conflicts = FALSE)
   
   inject_na <- function(x) {
     x[sample(seq_along(x), length(x) / 10)] <- NA
     x
   }
   
   df <- data.frame(
     x = runif(1e6), 
     y = sample(1:100, 1e6, replace = TRUE), 
     z = sample(month.abb, 1e6, replace = TRUE)
   ) %>% 
     mutate(
       across(everything(), inject_na, .names = "{.col}_na")
     )
   
   a <- Array$create(df)
   a$type
   #> StructType
   #> struct<x: double, y: int32, z: string, x_na: double, y_na: int32, z_na: 
string>
   
   bench::workout({
     v <- a$as_vector()
     v$x
     v$y
     v$z
     v$x_na
     v$y_na
     v$z_na
   })
   #> # A tibble: 7 × 3
   #>   exprs               process     real
   #>   <bch:expr>         <bch:tm> <bch:tm>
   #> 1 v <- a$as_vector()    236µs 237.54µs
   #> 2 v$x                    43µs  45.07µs
   #> 3 v$y                     3µs   4.86µs
   #> 4 v$z                     3µs   3.95µs
   #> 5 v$x_na                  3µs   3.81µs
   #> 6 v$y_na                  4µs   3.71µs
   #> 7 v$z_na                  3µs   3.51µs
   
   # get few values
   bench::workout({
     v$x[1:10]
     v$y[1:10]
     v$z[1:10]
     v$x_na[1:10]
     v$y_na[1:10]
     v$z_na[1:10]
   })
   #> # A tibble: 6 × 3
   #>   exprs         process     real
   #>   <bch:expr>   <bch:tm> <bch:tm>
   #> 1 v$x[1:10]        14µs  16.55µs
   #> 2 v$y[1:10]        13µs  13.92µs
   #> 3 v$z[1:10]        40µs  41.81µs
   #> 4 v$x_na[1:10]      6µs   7.38µs
   #> 5 v$y_na[1:10]      6µs   6.56µs
   #> 6 v$z_na[1:10]     30µs  30.25µs
   
   # duplicate -> materialize
   bench::workout({
     v$x[]
     v$y[]
     v$z[]
     v$x_na[]
     v$y_na[]
     v$z_na[]
   })
   #> # A tibble: 6 × 3
   #>   exprs       process     real
   #>   <bch:expr> <bch:tm> <bch:tm>
   #> 1 v$x[]        4.17ms   4.17ms
   #> 2 v$y[]        2.08ms   2.08ms
   #> 3 v$z[]      111.07ms 111.13ms
   #> 4 v$x_na[]     6.39ms   6.39ms
   #> 5 v$y_na[]     4.12ms   4.12ms
   #> 6 v$z_na[]    31.86ms  31.89ms
   ```
   
   <sup>Created on 2021-09-28 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.0)</sup>
   
   I guess it's a tradeoff, this pr being more lazy, whereas the previous code 
was greedy (materializing copies immediately) but also doing it in parallel. 
   
   I think it's fine to only pay for making standard R vectors when it becomes 
necessary.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] romainfrancois commented on pull request #11225: ARROW-13112: [R] altrep vectors for strings and other types

Reply via email to