jorisvandenbossche commented on issue #41162: URL: https://github.com/apache/arrow/issues/41162#issuecomment-2056039765
This is somewhat "expected" (or at least something that has been implemented like this consciously AFAIK because of lack of good alternatives), although probably one or all of inconsistent/surprising/undocumented. > It's my understanding that numpy's datetimes aren't timezone-aware ([ref](https://numpy.org/devdocs/reference/arrays.datetime.html)) so it seems possible PyArrow is inheriting that behavior. The pandas docs [point to the arrays.DatetimeArray extensiontype](https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/arrays.html#datetime-data) which I don't think PyArrow is making use of. This indeed goes to the crux if the issue. If we consider the non-nested case first for a moment, there are essentially three ways we can convert a tz-aware timestamp array to pandas/numpy: as numpy datetime64 dtype (losing any tz information), as pandas' tz-aware datetime64 dtype, or as python objects: ```python >>> ts = pd.Timestamp('2024-01-01 12:00:00+0000', tz = 'Europe/Paris') >>> arr = pa.array([ts]) # numpy datetime64 dtype (losing any tz information) >>> arr.to_numpy() array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]') # pandas' tz-aware datetime64 dtype >>> arr.to_pandas().array <DatetimeArray> ['2024-01-01 13:00:00+01:00'] Length: 1, dtype: datetime64[us, Europe/Paris] # python objects >>> arr.to_pandas(timestamp_as_object=True).to_numpy() array([datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)], dtype=object) ``` The above is for top-level (non-nested) fields, and in that case we default to use pandas' custom tz-aware extension type in `to_pandas()`. However, for nested arrays the situation is a bit different, as you noted in the OP: ```python # struct >>> arr = pa.array([{"a": ts}]) >>> arr.to_pandas().to_numpy() array([{'a': datetime.datetime(2024, 1, 1, 13, 0, tzinfo=<DstTzInfo 'Europe/Paris' CET+1:00:00 STD>)}], dtype=object) # list >>> arr = pa.array([[ts]]) >>> arr.to_pandas().to_numpy() array([array(['2024-01-01T12:00:00.000000'], dtype='datetime64[us]')], dtype=object) ``` For structs, the data is being converted to python dictionaries, and so since we convert to python objects anyway, we essentially do the "as python object" conversion for the flat field (https://github.com/apache/arrow/pull/7604). For a list, you can see that this is using the numpy datetime64 dtype (and thus losing the tz information). The reason for this is maybe a bit more technical (or historically), but how this conversion happens is that at the PyArrow C++ level, we create one numpy array for the flat values behind the ListArray, and then create a object-dtype numpy array of slices of that parent numpy array. This currently happens at the C++ level, and at that point we only deal with numpy arrays, and not with pandas ExtensionArrays. (as a similar example, a dictionary encoded array is converted to a pandas.Categorical extension array, but a dictionary child in a list is converted to the plain numpy type as well) If we would like to preserve this information, we would need to create the pandas datetimetz array at the C++ level. Now, that should actually be possible, although given this would go through plain python calls (pandas has no C API), that might give quite a slowdown compared to the current conversion (but that's something to test to have an idea how significant that would be) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
