[jira] [Commented] (ARROW-8888) [Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions

Joris Van den Bossche (Jira) Tue, 09 Jun 2020 02:40:18 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129062#comment-17129062
 ]


Joris Van den Bossche commented on ARROW-8888:
----------------------------------------------

Trying with strings, I don't see much speedup in that case by using 
multithreading, but also not a significant slowdown:

{code}
In [10]: df = pd.DataFrame({key: ['a'] * 1_000_000 for key in range(10)})       
                                                                                
                                                   

In [11]: %timeit table = pa.Table.from_pandas(df)                               
                                                                                
                                                   
3.43 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit table = pa.Table.from_pandas(df, nthreads=1)                   
                                                                                
                                                   
3.79 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}

Strings are stored in pandas as Python objects. I am not fully sure how the 
conversion in arrow is implemented, but so it might be this requires the GIL, 
and then multithreading won't help.

Such a big slowdown as you mention is still strange though, so if you can try 
to look further into it, that's certainly welcome.

> [Python] Heuristic in dataframe_to_arrays that decides to multithread convert 
> cause slow conversions
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8888
>                 URL: https://issues.apache.org/jira/browse/ARROW-8888
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: MacOS: 10.15.4 (Also happening on windows 10)
> Python: 3.7.3
> Pyarrow: 0.16.0
> Pandas: 0.25.3
>            Reporter: Kevin Glasson
>            Priority: Minor
>
> When calling pa.Table.from_pandas() the code path that uses the 
> ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
> conversion is much much slower.
>  
>  I have a simple example - but the time difference is much worse with a real 
> table.
>  
> {code:java}
> Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
>  Type 'copyright', 'credits' or 'license' for more information
>  IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
> In [1]: import pyarrow as pa
> In [2]: import pandas as pd
> In [3]: df = pd.DataFrame({"A": [0] * 10000000})
> In [4]: %timeit table = pa.Table.from_pandas(df)
>  577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
>  106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8888) [Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions

Reply via email to