[
https://issues.apache.org/jira/browse/ARROW-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17432406#comment-17432406
]
Joris Van den Bossche commented on ARROW-12976:
-----------------------------------------------
[~emkornfield] this is specifically for the scalar {{as_py()}} and array
{{to_pylist()}} behaviour, right? (and not the Table.to_pandas)
Personally I would be fine with a more explicit API (the idea would be to add a
keyword to those functions to explicitly ask for a pandas object?). But some
concerns:
1) changing to use datetime.datetime instead of pd.Timestamp by default for ns
resolution would be a backwards incompatible change. How do we see that? Just
change, or deprecate first? (it seems a bit annoying to deprecate, although if
we add a keyword, that can directly be used to silence the warning, and I
suppose those functions are not used that much anyway)
2) If you have nanoseconds in the timestamp value, that means we would raise an
error by default? (the one we raise now if pandas is not installed) That
doesn't feel super nice user experience, but I suppose this is the inevitable
consequence of a more explicit API.
> [Python] Arrow-to-Python conversion is slow
> -------------------------------------------
>
> Key: ARROW-12976
> URL: https://issues.apache.org/jira/browse/ARROW-12976
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Antoine Pitrou
> Assignee: Micah Kornfield
> Priority: Major
>
> It seems that we are 20x slower than Numpy for converting the exact same data
> to a Python list.
> With integers:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.int64)
> >>> %timeit arr.tolist()
> 8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> With floats:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.float64)
> >>> %timeit arr.tolist()
> 10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)