[ 
https://issues.apache.org/jira/browse/ARROW-8888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129030#comment-17129030
 ] 

Joris Van den Bossche commented on ARROW-8888:
----------------------------------------------

There is currently a heuristic based on the number of rows vs number of columns 
whether to use multithreading or not when it is not specified by the user 
([https://github.com/apache/arrow/blob/d00c50a6ca0d88e3458742091c59f0fc5c2fc7de/python/pyarrow/pandas_compat.py#L541-L549).]

And this will probably not be the best decision for all cases. For example, you 
have only a single column, and the parallelization is done by processing each 
column in a thread, so clearly in the case of a single column, doing it in a 
threadpool will only give unnecessary overhead. Also, ints are very cheap 
(zero-copy) to convert. 

So I suspect that with more columns and with a more expensive conversion, you 
will see the benefit of the default multithreading. For example using floats 
instead of ints and more columns:

{code}
In [1]: df = pd.DataFrame({key: [0.0] * 1_000_000 for key in range(100)})       
                                                                                
                                                   

In [2]: %timeit table = pa.Table.from_pandas(df, nthreads=1)                    
                                                                                
                                                   
1.3 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit table = pa.Table.from_pandas(df, nthreads=None)                 
                                                                                
                                                   
327 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}

(with this amount of columns but with only ints, using no threads is still 
faster).

So you are certainly welcome to look into the default heuristic to decide how 
many threads are used, and whether this can be improved (eg ensure that we are 
not using more threads than there are columns), but it will never be ideal for 
all possible use cases I think.


> [Python] Heuristic in dataframe_to_arrays that decides to multithread convert 
> cause slow conversions
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8888
>                 URL: https://issues.apache.org/jira/browse/ARROW-8888
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0
>         Environment: MacOS: 10.15.4 (Also happening on windows 10)
> Python: 3.7.3
> Pyarrow: 0.16.0
> Pandas: 0.25.3
>            Reporter: Kevin Glasson
>            Priority: Minor
>
> When calling pa.Table.from_pandas() the code path that uses the 
> ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the 
> conversion is much much slower.
>  
>  I have a simple example - but the time difference is much worse with a real 
> table.
>  
> {code:java}
> Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
>  Type 'copyright', 'credits' or 'license' for more information
>  IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
> In [1]: import pyarrow as pa
> In [2]: import pandas as pd
> In [3]: df = pd.DataFrame({"A": [0] * 10000000})
> In [4]: %timeit table = pa.Table.from_pandas(df)
>  577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
>  106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to