Wes McKinney created ARROW-10569: ------------------------------------ Summary: [C++][Python] Poor Table filtering performance Key: ARROW-10569 URL: https://issues.apache.org/jira/browse/ARROW-10569 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Wes McKinney Fix For: 3.0.0
>From the mailing list {code:java} import pandas as pd import pyarrow as pa import pyarrow.compute as pc import numpy as np num_rows = 10_000_000 data = np.random.randn(num_rows) df = pd.DataFrame({'data{}'.format(i): data for i in range(100)}) df['key'] = np.random.randint(0, 100, size=num_rows) rb = pa.record_batch(df) t = pa.table(df) I found that the performance of filtering a record batch is very similar: In [22]: timeit df[df.key == 5] 71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [24]: %timeit rb.filter(pc.equal(rb[-1], 5)) 75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) Whereas the performance of filtering a table is absolutely abysmal (no idea what's going on here) In [23]: %timeit t.filter(pc.equal(t[-1], 5)) 961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} [https://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3Ehttps://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)