Re: [I] [R] Speed up `nrow()` on filtered dataset [arrow]

via GitHub Mon, 12 Aug 2024 11:28:14 -0700


nealrichardson commented on issue #43659:
URL: https://github.com/apache/arrow/issues/43659#issuecomment-2284656634


   FWIW I'm not seeing this at least on this query using a smaller sample of 
nyc_taxi:
   
   ```
   bench::mark(
     old = nyc_taxi |> filter(total_amount > 100) |> nrow(), 
     new = nyc_taxi |> summarize(sum(total_amount > 100)) |> collect() |> pull()
   )
   
   # A tibble: 2 × 13
     expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc 
total_time
     <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   
<bch:tm>
   1 old          16.3ms 17.3ms      57.1    69.1KB     11.9    24     5      
421ms
   2 new          20.4ms 21.3ms      46.5   129.5KB     16.4    17     6      
366ms
   # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
   ```
   
   That said, the first time I did it, it *was* slower, but on subsequent tries 
it was faster. Sounds like disk caching or something?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] Speed up `nrow()` on filtered dataset [arrow]

Reply via email to