[GitHub] [arrow] nealrichardson commented on pull request #7668: ARROW-6982: [R] Add bindings for compare and boolean kernels

GitBox Wed, 22 Jul 2020 10:37:35 -0700


nealrichardson commented on pull request #7668:
URL: https://github.com/apache/arrow/pull/7668#issuecomment-662587689



   Here's an informal benchmark that shows the benefit of pushing all of this 
work down into Arrow, reproducing how the "old" way (on current master) calls 
`as.vector` on all Arrays before doing any comparisons or aggregations. There's 
a 4-5x speedup doing the comparison, filtering, and aggregation in Arrow, even 
with eager evaluation:
   
   ```r
   library(arrow)
   tab <- read_parquet("nyc-taxi/2019/06/data.parquet", as_data_frame = FALSE)
   dim(tab)
   ## [1] 6941024      18
   
   bench::mark(
     new = as.vector(mean(tab$fare_amount[tab$trip_distance > 1 & 
tab$passenger_count < 4], na.rm = TRUE)),
     old = mean(as.vector(tab$fare_amount[as.vector(tab$trip_distance) > 1 & 
as.vector(tab$passenger_count) < 4]), na.rm = TRUE)
   )
   ## # A tibble: 2 x 13
   ##   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc 
total_time
   ##   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   
<bch:tm>
   ## 1 new         47.4ms  47.7ms     17.6     10.2KB     1.95     9     1     
 512ms
   ## 2 old        207.3ms 213.8ms      4.70   327.6MB    12.5      3     8     
 638ms
   ## # … with 4 more variables: result <list>, memory <list>, time <list>, gc 
<list>
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on pull request #7668: ARROW-6982: [R] Add bindings for compare and boolean kernels

Reply via email to