Joris Van den Bossche created ARROW-10423:
---------------------------------------------
Summary: [C++] Filter compute function seems slow compared to
numpy nonzero + take
Key: ARROW-10423
URL: https://issues.apache.org/jira/browse/ARROW-10423
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
>From
>https://stackoverflow.com/questions/64581590/is-there-a-more-efficient-way-to-select-rows-from-a-pyarrow-table-based-on-conte
I made a smaller, simplified example:
{code:python}
arr = pa.array(np.random.randn(1_000_000))
# mask with only few True values
mask1 = np.zeros(len(arr), dtype=bool)
mask1[np.random.randint(len(arr), size=100)] = True
mask1_pa = pa.array(mask1)
# mask with larger proportion of True values
mask2 = np.zeros(len(arr), dtype=bool)
mask2[np.random.randint(len(arr), size=10_000)] = True
mask2_pa = pa.array(mask2)
{code}
Doing timings of doing a Arrow {{Filter}} kernel vs using numpy to convert the
mask into indices and then using a {{Take}} kernel:
{code}
# mask 1
In [3]: %timeit arr.filter(mask1_pa)
132 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [4]: %%timeit
...: indices = np.nonzero(mask1)[0]
...: arr.take(indices)
114 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# mask 2
In [8]: %timeit arr.filter(mask2_pa)
711 µs ± 63.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %%timeit
...: indices = np.nonzero(mask2)[0]
...: arr.take(indices)
333 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
{code}
So in the first case, both are quite similar in timing. But in the second case,
the numpy+take version is faster.
I know this might depend on a lot on the actual proportion of True values and
how they are positioned in the array (random vs concentrated) etc, so there is
probably not a general rule of what should be faster.
But, it still seems a potential indication that things can be optimized in the
Filter kernel.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)