paleolimbot commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1875615355
I check to make sure that nothing unepxected is happening (e.g., we had an
issue before where we were materializing the entire array by accident for each
call to `Elt()`), and nothing seems to be amiss: the underlying implementation
is calling `ISNAN(REAL_ELT(x))` (or similar, I didn't check) a lot of times for
ALTREP objects. For us, that's very slow.
A better implementation might call `REAL_GET_REGION()`. If it did, the
ALTREP implementation would be slower but not nearly as bad as extracting each
element individually.
``` r
library(arrow, warn.conflicts = FALSE)
x <- runif(29500000) * 10
x_altrep <- as.vector(as_chunked_array(x))
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g0c0 [REF(65535)]
arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
# Probably a better implementation than base R's
cpp11::cpp_function("
cpp11::logicals is_na2(cpp11::doubles x) {
int region_size = 1024;
R_xlen_t n = x.size();
cpp11::writable::logicals out(n);
cpp11::writable::doubles buf_shelter(region_size);
double* buf = REAL(buf_shelter);
for (R_xlen_t i = 0; i < n; i++) {
if ((i % region_size) == 0) {
REAL_GET_REGION(x, i, region_size, buf);
}
out[i] = ISNAN(buf[i % region_size]);
}
return out;
}
")
bench::mark(
is.na(x),
is.na(x_altrep),
is_na2(x),
is_na2(x_altrep)
)
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 is.na(x) 23.8ms 24.1ms 40.6 113MB 14.5
#> 2 is.na(x_altrep) 315.7ms 317.6ms 3.15 113MB 0
#> 3 is_na2(x) 60.8ms 61.4ms 16.3 113MB 5.44
#> 4 is_na2(x_altrep) 62ms 62.3ms 16.0 113MB 5.35
# Make sure we didn't materialize
.Internal(inspect(x_altrep))
#> @1064c4180 14 REALSXP g1c0 [MARK,REF(65535)]
arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
```
<sup>Created on 2024-01-03 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
It does beg the question of whether ALTREP by default is worth the trouble.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]