nealrichardson commented on pull request #7668:
URL: https://github.com/apache/arrow/pull/7668#issuecomment-662587689
Here's an informal benchmark that shows the benefit of pushing all of this
work down into Arrow, reproducing how the "old" way (on current master) calls
`as.vector` on all Arrays before doing any comparisons or aggregations. There's
a 4-5x speedup doing the comparison, filtering, and aggregation in Arrow, even
with eager evaluation:
```r
library(arrow)
tab <- read_parquet("nyc-taxi/2019/06/data.parquet", as_data_frame = FALSE)
dim(tab)
## [1] 6941024 18
bench::mark(
new = as.vector(mean(tab$fare_amount[tab$trip_distance > 1 &
tab$passenger_count < 4], na.rm = TRUE)),
old = mean(as.vector(tab$fare_amount[as.vector(tab$trip_distance) > 1 &
as.vector(tab$passenger_count) < 4]), na.rm = TRUE)
)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
total_time
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl>
<bch:tm>
## 1 new 47.4ms 47.7ms 17.6 10.2KB 1.95 9 1
512ms
## 2 old 207.3ms 213.8ms 4.70 327.6MB 12.5 3 8
638ms
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc
<list>
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]