[jira] [Created] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

Paul Balanca (Jira) Tue, 22 Dec 2020 07:59:34 -0800

Paul Balanca created ARROW-11006:
------------------------------------

             Summary: [Python] Array to_numpy slow compared to Numpy.view
                 Key: ARROW-11006
                 URL: https://issues.apache.org/jira/browse/ARROW-11006
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
            Reporter: Paul Balanca
            Assignee: Paul Balanca



The method `to_numpy` is quite slow compare Numpy slice and viewing 
performance. For instance:
{code:java}
N = 1000000
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any 
operation not available in PyArrow, failing back on Numpy is a good option and 
the cost of extracting should be as minimal as possible (there are scenarios 
where you can't cache easily this view, so you end up calling `to_numpy` a fair 
amount of times).

I would believe that part of this overhead is probably due to PyArrow 
implementing a very generic Pandas conversion, and using this one even for very 
simple Numpy-like dense arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

Reply via email to