[ 
https://issues.apache.org/jira/browse/ARROW-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432406#comment-17432406
 ] 

Joris Van den Bossche commented on ARROW-12976:
-----------------------------------------------

[~emkornfield] this is specifically for the scalar {{as_py()}} and array 
{{to_pylist()}} behaviour, right? (and not the Table.to_pandas)

Personally I would be fine with a more explicit API (the idea would be to add a 
keyword to those functions to explicitly ask for a pandas object?). But some 
concerns:

1) changing to use datetime.datetime instead of pd.Timestamp by default for ns 
resolution would be a backwards incompatible change. How do we see that? Just 
change, or deprecate first? (it seems a bit annoying to deprecate, although if 
we add a keyword, that can directly be used to silence the warning, and I 
suppose those functions are not used that much anyway)

2) If you have nanoseconds in the timestamp value, that means we would raise an 
error by default? (the one we raise now if pandas is not installed) That 
doesn't feel super nice user experience, but I suppose this is the inevitable 
consequence of a more explicit API.

> [Python] Arrow-to-Python conversion is slow
> -------------------------------------------
>
>                 Key: ARROW-12976
>                 URL: https://issues.apache.org/jira/browse/ARROW-12976
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Antoine Pitrou
>            Assignee: Micah Kornfield
>            Priority: Major
>
> It seems that we are 20x slower than Numpy for converting the exact same data 
> to a Python list.
> With integers:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.int64)
> >>> %timeit arr.tolist()
> 8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> With floats:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.float64)
> >>> %timeit arr.tolist()
> 10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to