nealrichardson commented on issue #43659:
URL: https://github.com/apache/arrow/issues/43659#issuecomment-2284656634
FWIW I'm not seeing this at least on this query using a smaller sample of
nyc_taxi:
```
bench::mark(
old = nyc_taxi |> filter(total_amount > 100) |> nrow(),
new = nyc_taxi |> summarize(sum(total_amount > 100)) |> collect() |> pull()
)
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
total_time
<bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl>
<bch:tm>
1 old 16.3ms 17.3ms 57.1 69.1KB 11.9 24 5
421ms
2 new 20.4ms 21.3ms 46.5 129.5KB 16.4 17 6
366ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
```
That said, the first time I did it, it *was* slower, but on subsequent tries
it was faster. Sounds like disk caching or something?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]