pitrou opened a new pull request #8777:
URL: https://github.com/apache/arrow/pull/8777
Instead of applying the same boolean filter to all N columns, first convert
to filter to take indices.
Selecting indices of a linear array is much faster, thanks to avoiding
bit-unpacking on the boolean filter.
Using the Python benchmark setup from ARROW-10569:
```python
import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np
num_rows = 10_000_000
data = np.random.randn(num_rows)
df = pd.DataFrame({'data{}'.format(i): data
for i in range(100)})
df['key'] = np.random.randint(0, 100, size=num_rows)
rb = pa.record_batch(df)
t = pa.table(df)
```
* before:
```python
>>> %timeit rb.filter(pc.equal(rb[-1], 5))
60 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit t.filter(pc.equal(t[-1], 5))
1.22 s ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
* after:
```python
>>> %timeit rb.filter(pc.equal(rb[-1], 5))
59.2 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit t.filter(pc.equal(t[-1], 5))
59.3 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]