[
https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026367#comment-17026367
]
Wes McKinney commented on ARROW-7723:
-------------------------------------
Alright, so here is what is going on.
{{PyArray_GETITEM}} returns different data types for datetime64 depending on
the unit.
* For resolutions coarser than nanoseconds, datetime.datetime is returned
* For nanoseconds, PyLong is returned
And this makes sense because datetime.datetime cannot faithfully represent
nanoseconds (this is why we have {{pandas.Timestamp}}).
In ARROW-3789, while unifying the DataFrame and Series conversion paths, I
altered all TZ-aware timestamp data to go through a nanosecond promotion in C++
(since things will end up as datetimetz data type in pandas). But then this
triggers the PyLong path during the struct conversion for second through
microsecond resolution.
Note that in both 0.15.1 and master, the bad behavior that [~bryanc] cited is
present for nanosecond resolution.
I'm going to try to kludge things to preserve the 0.15.1 behavior at least but
the inconsistency here seems broken to me and we should probably do something
about it after 0.16.0
> [Python] StructArray timestamp type with timezone to_pandas convert error
> --------------------------------------------------------------------------
>
> Key: ARROW-7723
> URL: https://issues.apache.org/jira/browse/ARROW-7723
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Bryan Cutler
> Assignee: Wes McKinney
> Priority: Blocker
> Fix For: 0.16.0
>
>
> When a {{StructArray}} has a child that is a timestamp with a timezone, the
> {{to_pandas}} conversion outputs an int64 instead of a timestamp
> {code:java}
> In [1]: import pyarrow as pa
> ...: import pandas as pd
> ...: arr = pa.array([{'start': pd.Timestamp.now(), 'end':
> pd.Timestamp.now()}])
> ...:
>
> In [2]: arr.to_pandas()
>
> Out[2]:
> 0 {'end': 2020-01-29 11:38:02.792681, 'start': 2...
> dtype: object
> In [3]: ts = pd.Timestamp.now()
>
> In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York'))
>
> In [5]: arr2.to_pandas()
>
> Out[5]:
> 0 2020-01-29 06:38:47.848944-05:00
> dtype: datetime64[ns, America/New_York]
> In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop'])
>
> In [7]: arr.to_pandas()
>
> Out[7]:
> 0 {'start': 1580297927848944000, 'stop': 1580297...
> dtype: object
> {code}
> from https://github.com/apache/arrow/pull/6312
--
This message was sent by Atlassian Jira
(v8.3.4#803005)