jorisvandenbossche opened a new issue, #36059:
URL: https://github.com/apache/arrow/issues/36059

   From some benchmarking in pandas, @phofl noticed that the "is_in" 
implementation of Arrow is considerably slower compared to pandas' 
(khash-based) implementation, _if_ the value_set is big (and thus when the 
execution time is mostly dominated by creating the hashtable, and not by the 
actual lookups). 
   
   
   Small illustration comparing `pc.is_in` with `pandas.core.algorithms.isin` 
(the equivalent pandas function) using a larger value_set with a small array 
(so that we are mostly timing the hashtable creation):
   ```
   arr = pa.array(np.random.choice(np.arange(1000), 100))
   for n in [10_000, 100_000, 1_000_000, 10_000_000]:
       print(n)
       value_set = pa.array(np.arange(n))
       %timeit pc.is_in(arr, value_set)
       %timeit pd.core.algorithms.isin(np.asarray(arr), np.asarray(value_set))
   ```
   
   gives
   
   ```
   10000
   158 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   66.8 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
   100000
   8.51 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   627 µs ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
   1000000
   103 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   28.5 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   10000000
   1.26 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   171 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   So pandas is often 4-10x faster. I am not familiar with our HashTable 
implementation or whether it's a fully fair comparison, but it seems this is at 
least an indication there is room for performance improvement (and I suppose 
this might be for HashTable-based functions in general, not just "is_in").


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to