jorisvandenbossche opened a new issue, #36059:
URL: https://github.com/apache/arrow/issues/36059
From some benchmarking in pandas, @phofl noticed that the "is_in"
implementation of Arrow is considerably slower compared to pandas'
(khash-based) implementation, _if_ the value_set is big (and thus when the
execution time is mostly dominated by creating the hashtable, and not by the
actual lookups).
Small illustration comparing `pc.is_in` with `pandas.core.algorithms.isin`
(the equivalent pandas function) using a larger value_set with a small array
(so that we are mostly timing the hashtable creation):
```
arr = pa.array(np.random.choice(np.arange(1000), 100))
for n in [10_000, 100_000, 1_000_000, 10_000_000]:
print(n)
value_set = pa.array(np.arange(n))
%timeit pc.is_in(arr, value_set)
%timeit pd.core.algorithms.isin(np.asarray(arr), np.asarray(value_set))
```
gives
```
10000
158 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
66.8 µs ± 473 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
100000
8.51 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
627 µs ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
1000000
103 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
28.5 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10000000
1.26 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
171 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
So pandas is often 4-10x faster. I am not familiar with our HashTable
implementation or whether it's a fully fair comparison, but it seems this is at
least an indication there is room for performance improvement (and I suppose
this might be for HashTable-based functions in general, not just "is_in").
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]