paleolimbot commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1834306130

   Thanks for opening the issue, and thanks for the reprex!
   
   It is true that ALTREP objects generally perform more slowly than non-ALTREP 
objects, although I wouldn't have expected this particular operation to be that 
much slower.
   
   I will dig into this, but in the meantime, you can turn ALTREP off using 
`options(arrow.use_altrep = FALSE)`:
   
   ``` r
   library(arrow)
   library(dplyr)
   options(arrow.use_altrep = FALSE)
   
   # generate data
   x = runif(29500000) * 10
   d = data.frame(cv = x)
   write_dataset(d, "/tmp/data.arrow")
   # then read back
   df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
   x = df$cv
   y = x + 0
   
   identical(x, y)
   #> [1] TRUE
   microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})
   #> Warning in microbenchmark::microbenchmark(x = {: less accurate nanosecond 
times
   #> to avoid potential integer overflows
   #> Unit: milliseconds
   #>  expr      min       lq     mean   median       uq       max neval
   #>     x 41.99819 42.61737 46.73451 46.07767 46.77581 106.51480   100
   #>     y 41.97875 42.59027 46.16944 46.06932 46.63008  67.24804   100
   ```
   
   <sup>Created on 2023-11-30 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to