Re: [I] [Python] Difference in timezone-awareness of result when calling to_pandas between unnested and nested timestamp arrays [arrow]

via GitHub Mon, 15 Apr 2024 01:02:25 -0700


jorisvandenbossche commented on issue #41162:
URL: https://github.com/apache/arrow/issues/41162#issuecomment-2056039765


   This is somewhat  "expected" (or at least something that has been 
implemented like this consciously AFAIK because of lack of good alternatives), 
although probably one or all of inconsistent/surprising/undocumented.
   
   > It's my understanding that numpy's datetimes aren't timezone-aware 
([ref](https://numpy.org/devdocs/reference/arrays.datetime.html)) so it seems 
possible PyArrow is inheriting that behavior. The pandas docs [point to the 
arrays.DatetimeArray 
extensiontype](https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/arrays.html#datetime-data)
 which I don't think PyArrow is making use of.
   
   This indeed goes to the crux if the issue. 
   If we consider the non-nested case first for a moment, there are essentially 
three ways we can convert a tz-aware timestamp array to pandas/numpy: as numpy 
datetime64 dtype (losing any tz information), as pandas' tz-aware datetime64 
dtype, or as python objects:
   
   ```python
   >>> ts = pd.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris')
   >>> arr = pa.array([ts])
   # numpy datetime64 dtype (losing any tz information)
   >>> arr.to_numpy()
   array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]')
   # pandas' tz-aware datetime64 dtype
   >>> arr.to_pandas().array
   <DatetimeArray>
   ['2024-01-01 13:00:00+01:00']
   Length: 1, dtype: datetime64[us, Europe/Paris]
   # python objects
   >>> arr.to_pandas(timestamp_as_object=True).to_numpy()
   array([datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' 
CET+1:00:00 STD>)],
         dtype=object)
   ```
   
   The above is for top-level (non-nested) fields, and in that case we default 
to use pandas' custom tz-aware extension type in `to_pandas()`. 
   
   However, for nested arrays the situation is a bit different, as you noted in 
the OP:
   
   ```python
   # struct
   >>> arr = pa.array([{"a": ts}])
   >>> arr.to_pandas().to_numpy()
   array([{'a': datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 
'Europe/Paris' CET+1:00:00 STD>)}],
         dtype=object)
   
   # list
   >>> arr = pa.array([[ts]])
   >>> arr.to_pandas().to_numpy()
   array([array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]')],
         dtype=object)
   ```
   
   For structs, the data is being converted to python dictionaries, and so 
since we convert to python objects anyway, we essentially do the "as python 
object" conversion for the flat field 
(https://github.com/apache/arrow/pull/7604).
   
   For a list, you can see that this is using the numpy datetime64 dtype (and 
thus losing the tz information). The reason for this is maybe a bit more 
technical (or historically), but how this conversion happens is that at the 
PyArrow C++ level, we create one numpy array for the flat values behind the 
ListArray, and then create a object-dtype numpy array of slices of that parent 
numpy array. This currently happens at the C++ level, and at that point we only 
deal with numpy arrays, and not with pandas ExtensionArrays. 
   (as a similar example, a dictionary encoded array is converted to a 
pandas.Categorical extension array, but a dictionary child in a list is 
converted to the plain numpy type as well)
   
   If we would like to preserve this information, we would need to create the 
pandas datetimetz array at the C++ level. Now, that should actually be 
possible, although given this would go through plain python calls (pandas has 
no C API), that might give quite a slowdown compared to the current conversion 
(but that's something to test to have an idea how significant that would be)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Difference in timezone-awareness of result when calling to_pandas between unnested and nested timestamp arrays [arrow]

Reply via email to