[GitHub] [arrow] pitrou opened a new pull request #8777: ARROW-10569: [C++] Improve table filtering performance

GitBox Thu, 26 Nov 2020 07:59:13 -0800


pitrou opened a new pull request #8777:
URL: https://github.com/apache/arrow/pull/8777



   Instead of applying the same boolean filter to all N columns, first convert 
to filter to take indices. 
   Selecting indices of a linear array is much faster, thanks to avoiding 
bit-unpacking on the boolean filter.
   
   Using the Python benchmark setup from ARROW-10569:
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.compute as pc
   import numpy as np
   
   num_rows = 10_000_000
   data = np.random.randn(num_rows)
   
   df = pd.DataFrame({'data{}'.format(i): data
                      for i in range(100)})
   
   df['key'] = np.random.randint(0, 100, size=num_rows)
   
   rb = pa.record_batch(df)
   t = pa.table(df)
   ```
   
   * before:
   ```python
   >>> %timeit rb.filter(pc.equal(rb[-1], 5))
   60 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   >>> %timeit t.filter(pc.equal(t[-1], 5))
   1.22 s ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   * after:
   ```python
   >>> %timeit rb.filter(pc.equal(rb[-1], 5))
   59.2 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   >>> %timeit t.filter(pc.equal(t[-1], 5))
   59.3 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou opened a new pull request #8777: ARROW-10569: [C++] Improve table filtering performance

Reply via email to