paleolimbot commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1834306130
Thanks for opening the issue, and thanks for the reprex!
It is true that ALTREP objects generally perform more slowly than non-ALTREP
objects, although I wouldn't have expected this particular operation to be that
much slower.
I will dig into this, but in the meantime, you can turn ALTREP off using
`options(arrow.use_altrep = FALSE)`:
``` r
library(arrow)
library(dplyr)
options(arrow.use_altrep = FALSE)
# generate data
x = runif(29500000) * 10
d = data.frame(cv = x)
write_dataset(d, "/tmp/data.arrow")
# then read back
df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
x = df$cv
y = x + 0
identical(x, y)
#> [1] TRUE
microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})
#> Warning in microbenchmark::microbenchmark(x = {: less accurate nanosecond
times
#> to avoid potential integer overflows
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> x 41.99819 42.61737 46.73451 46.07767 46.77581 106.51480 100
#> y 41.97875 42.59027 46.16944 46.06932 46.63008 67.24804 100
```
<sup>Created on 2023-11-30 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]