[jira] [Commented] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error

Wes McKinney (Jira) Wed, 29 Jan 2020 17:23:22 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026367#comment-17026367
 ]


Wes McKinney commented on ARROW-7723:
-------------------------------------

Alright, so here is what is going on.

{{PyArray_GETITEM}} returns different data types for datetime64 depending on 
the unit.

* For resolutions coarser than nanoseconds, datetime.datetime is returned
* For nanoseconds, PyLong is returned

And this makes sense because datetime.datetime cannot faithfully represent 
nanoseconds (this is why we have {{pandas.Timestamp}}).

In ARROW-3789, while unifying the DataFrame and Series conversion paths, I 
altered all TZ-aware timestamp data to go through a nanosecond promotion in C++ 
(since things will end up as datetimetz data type in pandas). But then this 
triggers the PyLong path during the struct conversion for second through 
microsecond resolution.

Note that in both 0.15.1 and master, the bad behavior that [~bryanc] cited is 
present for nanosecond resolution. 

I'm going to try to kludge things to preserve the 0.15.1 behavior at least but 
the inconsistency here seems broken to me and we should probably do something 
about it after 0.16.0

> [Python] StructArray  timestamp type with timezone to_pandas convert error
> --------------------------------------------------------------------------
>
>                 Key: ARROW-7723
>                 URL: https://issues.apache.org/jira/browse/ARROW-7723
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Bryan Cutler
>            Assignee: Wes McKinney
>            Priority: Blocker
>             Fix For: 0.16.0
>
>
> When a {{StructArray}} has a child that is a timestamp with a timezone, the 
> {{to_pandas}} conversion outputs an int64 instead of a timestamp
> {code:java}
> In [1]: import pyarrow as pa 
>    ...: import pandas as pd 
>    ...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': 
> pd.Timestamp.now()}]) 
>    ...:                                                                       
>                                                
> In [2]: arr.to_pandas()                                                       
>                             
> Out[2]: 
> 0    {'end': 2020-01-29 11:38:02.792681, 'start': 2...
> dtype: object
> In [3]: ts = pd.Timestamp.now()                                               
>                                                
> In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) 
>                                                
> In [5]: arr2.to_pandas()                                                      
>                             
> Out[5]: 
> 0   2020-01-29 06:38:47.848944-05:00
> dtype: datetime64[ns, America/New_York]
> In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop'])     
>                                                
> In [7]: arr.to_pandas()                                                       
>                             
> Out[7]: 
> 0    {'start': 1580297927848944000, 'stop': 1580297...
> dtype: object
> {code}
> from https://github.com/apache/arrow/pull/6312



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error

Reply via email to