[ 
https://issues.apache.org/jira/browse/ARROW-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10569:
-----------------------------------
    Labels: pull-request-available  (was: )

> [C++][Python] Poor Table filtering performance
> ----------------------------------------------
>
>                 Key: ARROW-10569
>                 URL: https://issues.apache.org/jira/browse/ARROW-10569
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>            Reporter: Wes McKinney
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> From the mailing list
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.compute as pc
> import numpy as np
> num_rows = 10_000_000
> data = np.random.randn(num_rows)
> df = pd.DataFrame({'data{}'.format(i): data
>                    for i in range(100)})
> df['key'] = np.random.randint(0, 100, size=num_rows)
> rb = pa.record_batch(df)
> t = pa.table(df)
> I found that the performance of filtering a record batch is very similar:
> In [22]: timeit df[df.key == 5]
> 71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
> In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
> 75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> Whereas the performance of filtering a table is absolutely abysmal (no
> idea what's going on here)
> In [23]: %timeit t.filter(pc.equal(t[-1], 5))
> 961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>  {code}
>  
> [https://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3Ehttps://lists.apache.org/thread.html/r4d4ffa7935efb2902600b9024859211e53aa6552d43ba0ad83517af5%40%3Cuser.arrow.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to