[
https://issues.apache.org/jira/browse/ARROW-8888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129047#comment-17129047
]
Kevin Glasson commented on ARROW-8888:
--------------------------------------
Yeah - so off the top of my head it was about 6 million rows and 40 columns,
mostly string objects, mixed in with some timestamps.
When I was profiling it I could see it was spending nearly all of it's time in
'threading'.
The reduction was 10x, instead of a write taking 50 minutes it took 4. There
could be some other inefficiencies in my code of course but just changing that
one flag gave me that massive reduction.
I can try and share the profiling if I get around to running it again.
> [Python] Heuristic in dataframe_to_arrays that decides to multithread convert
> cause slow conversions
> ----------------------------------------------------------------------------------------------------
>
> Key: ARROW-8888
> URL: https://issues.apache.org/jira/browse/ARROW-8888
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.16.0
> Environment: MacOS: 10.15.4 (Also happening on windows 10)
> Python: 3.7.3
> Pyarrow: 0.16.0
> Pandas: 0.25.3
> Reporter: Kevin Glasson
> Priority: Minor
>
> When calling pa.Table.from_pandas() the code path that uses the
> ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the
> conversion is much much slower.
>
> I have a simple example - but the time difference is much worse with a real
> table.
>
> {code:java}
> Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
> In [1]: import pyarrow as pa
> In [2]: import pandas as pd
> In [3]: df = pd.DataFrame({"A": [0] * 10000000})
> In [4]: %timeit table = pa.Table.from_pandas(df)
> 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
> 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)