jorisvandenbossche commented on issue #38647: URL: https://github.com/apache/arrow/issues/38647#issuecomment-1854235246
Hey @0x26res, some late comments on this: your observation is fully correct, and I agree that ideally it would be good to be consistent, but so there is some historical context ;) First, the `types_mapper` keyword is actually only (at least originally) meant for pandas ExtensionDtypes (to specify to use an extension dtype instead of the default (typically numpy) dtype that pandas uses for a certain type of data). So it is not a generic way to specify a target schema (firstly it's per type and not per column, and so you also wouldn't be able to say to convert decimals to float, for example). And `datetime64[ns]` is not an pandas extension dtype, but a numpy dtype. (I suppose this is one of the few cases where this would occur, because in most other cases I can think of, we by default already convert to a numpy dtype, and when you want to do something differently you typically want an extension dtype. In this case the default is the sub-optimal object dtype, and it makes sense you want the numpy option) Secondly, there is the issue that the `types_mapper` keyword is handled differently in `Table.to_pandas` and `Array.to_pandas`, which leads to some inconsistency in this case. For `Table.to_pandas`, we have some preprocessing on the python side to pass down to the C++ `ConvertTableToPandas` indicating which columns of the table will be converted to extension dtypes, and then those columns are left as-is on the C++ side, and the conversion is done in Python afterwards. This conversion happens at: https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/pandas_compat.py#L723-L733 So this uses the `pandas_dtype.__from_arrow__` (a method that only exists pandas ExtensionDtypes, not on numpy dtypes). The preprocessing that I mentioned happens here: https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/pandas_compat.py#L792-L846 and so essentially that determines which columns of the table should use the above mechanism of using `dtype.__from_arrow__` to convert itself, and that is determined based on the pandas metadata, extension types in the arrow schema, and the `types_mapper` parameter. But so the fact that we call `__from_arrow__` for the types specified in `types_mapper` explains the error you get: ``` In [5]: table.to_pandas(types_mapper={pa.date32(): "datetime64[ns]"}.get) ... ValueError: This column does not support to be converted to a pandas ExtensionArray ``` (for this case, the error message is not super clear, though, because it's not becauseof the column but because of the specified dtype in types_mapper. This is something that we could improve) On the other hand, the `Array.to_pandas` handling of `types_mapper` is much simpler (and was added later to let it match in functionality compared to the table conversion). Here, in case of an extension type or types_mapper, we get the pandas dtype and call `__from_arrow__` _if it has that attribute_: https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/array.pxi#L1790-L1804 but then later in that `_array_like_to_pandas` helper function, after converting the pyarrow array to numpy, we pass that numpy array to the pandas.Series constructor with a `dtype` specified: https://github.com/apache/arrow/blob/d2209582a0ef81c93342183cab3c12d69e79c5be/python/pyarrow/array.pxi#L1824-L1834 and because of this passing of the `dtype` here (also if that comes from `types_mapper`), your example of using `datetime64[ns]` in the types_mapper works for the Array case. But I think that is actually a left-over from https://github.com/apache/arrow/pull/36314 (which added the `__from_arrow__` handling in Array.to_pandas), and could/should be removed, if we want to keep the `types_mapper` functionality strictly for pandas.ExtensionDtype (which would also make the behaviour consistent for both cases, here) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
