Re: [I] [Python] Table.to_pandas Can't convert date32 to "datetime64[ns]" [arrow]

via GitHub Wed, 13 Dec 2023 08:12:48 -0800


jorisvandenbossche commented on issue #38647:
URL: https://github.com/apache/arrow/issues/38647#issuecomment-1854235246


   Hey @0x26res, some late comments on this: your observation is fully correct, 
and I agree that ideally it would be good to be consistent, but so there is 
some historical context ;)
   
   First, the `types_mapper` keyword is actually only (at least originally) 
meant for pandas ExtensionDtypes (to specify to use an extension dtype instead 
of the default (typically numpy) dtype that pandas uses for a certain type of 
data). 
   So it is not a generic way to specify a target schema (firstly it's per type 
and not per column, and so you also wouldn't be able to say to convert decimals 
to float, for example).
   
   And `datetime64[ns]` is not an pandas extension dtype, but a numpy dtype.
   (I suppose this is one of the few cases where this would occur, because in 
most other cases I can think of, we by default already convert to a numpy 
dtype, and when you want to do something differently you typically want an 
extension dtype. In this case the default is the sub-optimal object dtype, and 
it makes sense you want the numpy option)
   
   Secondly, there is the issue that the `types_mapper` keyword is handled 
differently in `Table.to_pandas` and `Array.to_pandas`, which leads to some 
inconsistency in this case.
   
   For `Table.to_pandas`, we have some preprocessing on the python side to pass 
down to the C++ `ConvertTableToPandas` indicating which columns of the table 
will be converted to extension dtypes, and then those columns are left as-is on 
the C++ side, and the conversion is done in Python afterwards. This conversion 
happens at:
   
   
https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/pandas_compat.py#L723-L733
   
   So this uses the `pandas_dtype.__from_arrow__` (a method that only exists 
pandas ExtensionDtypes, not on numpy dtypes).
   
   The preprocessing that I mentioned happens here:
   
   
https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/pandas_compat.py#L792-L846
   
   and so essentially that determines which columns of the table should use the 
above mechanism of using `dtype.__from_arrow__` to convert itself, and that is 
determined based on the pandas metadata, extension types in the arrow schema, 
and the `types_mapper` parameter. 
   
   But so the fact that we call `__from_arrow__` for the types specified in 
`types_mapper` explains the error you get:
   
   ```
   In [5]: table.to_pandas(types_mapper={pa.date32(): "datetime64[ns]"}.get)
   ...
   ValueError: This column does not support to be converted to a pandas 
ExtensionArray
   ```
   
   (for this case, the error message is not super clear, though, because it's 
not becauseof the column but because of the specified dtype in types_mapper. 
This is something that we could improve)
   
   On the other hand, the `Array.to_pandas` handling of `types_mapper` is much 
simpler (and was added later to let it match in functionality compared to the 
table conversion). 
   Here, in case of an extension type or types_mapper, we get the pandas dtype 
and call `__from_arrow__` _if it has that attribute_:
   
   
https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/array.pxi#L1790-L1804
   
   but then later in that `_array_like_to_pandas` helper function, after 
converting the pyarrow array to numpy, we pass that numpy array to the 
pandas.Series constructor with a `dtype` specified:
   
   
https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/array.pxi#L1824-L1834
   
   and because of this passing of the `dtype` here (also if that comes from 
`types_mapper`), your example of using `datetime64[ns]` in the types_mapper 
works for the Array case. 
   But I think that is actually a left-over from 
https://github.com/apache/arrow/pull/36314 (which added the `__from_arrow__` 
handling in Array.to_pandas), and could/should be removed, if we want to keep 
the `types_mapper` functionality strictly for pandas.ExtensionDtype (which 
would also make the behaviour consistent for both cases, here)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Table.to_pandas Can't convert date32 to "datetime64[ns]" [arrow]

Reply via email to