Paul Balanca created ARROW-11006:
------------------------------------
Summary: [Python] Array to_numpy slow compared to Numpy.view
Key: ARROW-11006
URL: https://issues.apache.org/jira/browse/ARROW-11006
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Paul Balanca
Assignee: Paul Balanca
The method `to_numpy` is quite slow compare Numpy slice and viewing
performance. For instance:
{code:java}
N = 1000000
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)
%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any
operation not available in PyArrow, failing back on Numpy is a good option and
the cost of extracting should be as minimal as possible (there are scenarios
where you can't cache easily this view, so you end up calling `to_numpy` a fair
amount of times).
I would believe that part of this overhead is probably due to PyArrow
implementing a very generic Pandas conversion, and using this one even for very
simple Numpy-like dense arrays.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)