paleolimbot commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1875615355

   I check to make sure that nothing unepxected is happening (e.g., we had an 
issue before where we were materializing the entire array by accident for each 
call to `Elt()`), and nothing seems to be amiss: the underlying implementation 
is calling `ISNAN(REAL_ELT(x))` (or similar, I didn't check) a lot of times for 
ALTREP objects. For us, that's very slow.
   
   A better implementation might call `REAL_GET_REGION()`. If it did, the 
ALTREP implementation would be slower but not nearly as bad as extracting each 
element individually.
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   x <- runif(29500000) * 10
   x_altrep <- as.vector(as_chunked_array(x))
   .Internal(inspect(x_altrep))
   #> @1064c4180 14 REALSXP g0c0 [REF(65535)] 
arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
   
   # Probably a better implementation than base R's
   cpp11::cpp_function("
   cpp11::logicals is_na2(cpp11::doubles x) {
       int region_size = 1024;
       R_xlen_t n = x.size();
       cpp11::writable::logicals out(n);
       cpp11::writable::doubles buf_shelter(region_size);
       double* buf = REAL(buf_shelter);
       for (R_xlen_t i = 0; i < n; i++) {
         if ((i % region_size) == 0) {
           REAL_GET_REGION(x, i, region_size, buf);
         }
         out[i] = ISNAN(buf[i % region_size]);
       }
       return out;
   }                    
   ")
   
   bench::mark(
     is.na(x),
     is.na(x_altrep),
     is_na2(x),
     is_na2(x_altrep)
   )
   #> # A tibble: 4 × 6
   #>   expression            min   median `itr/sec` mem_alloc `gc/sec`
   #>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
   #> 1 is.na(x)           23.8ms   24.1ms     40.6      113MB    14.5 
   #> 2 is.na(x_altrep)   315.7ms  317.6ms      3.15     113MB     0   
   #> 3 is_na2(x)          60.8ms   61.4ms     16.3      113MB     5.44
   #> 4 is_na2(x_altrep)     62ms   62.3ms     16.0      113MB     5.35
   
   # Make sure we didn't materialize
   .Internal(inspect(x_altrep))
   #> @1064c4180 14 REALSXP g1c0 [MARK,REF(65535)] 
arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
   ```
   
   <sup>Created on 2024-01-03 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   
   It does beg the question of whether ALTREP by default is worth the trouble.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to