[
https://issues.apache.org/jira/browse/ARROW-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson resolved ARROW-7569.
------------------------------------
Resolution: Fixed
Issue resolved by pull request 6189
[https://github.com/apache/arrow/pull/6189]
> [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas
> conversions
> ---------------------------------------------------------------------------------------
>
> Key: ARROW-7569
> URL: https://issues.apache.org/jira/browse/ARROW-7569
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.16.0
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> ARROW-2428 was about adding such a mapping, and described three use cases
> (see this
> [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231]
> for details):
> * Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if
> the pandas_metadata specify pandas extension dtypes, and if so, use this as
> the target dtype for that column)
> * Conversion for pyarrow extension types that can define their equivalent
> pandas extension dtype
> * A way to override default conversion (eg for the built-in types, or in
> absence of pandas_metadata in the schema). This would require the user to be
> able to specify some mapping of pyarrow type or column name to the pandas
> extension dtype to use.
> The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512)
> only covered the first two cases, and not the third case.
> I think it is still interesting to also cover the third case in some way.
> An example use case are the new nullable dtypes that are introduced in pandas
> (eg the nullable integer dtype). Assume I want to read a parquet file into a
> pandas DataFrame using this nullable integer dtype. The pyarrow Table has no
> pandas_metadata indicating to use this dtype (unless it was created from a
> pandas DataFrame that was already using this dtype, but that will often not
> be the case), and the pyarrow.int64() type is also not an extension type that
> can define its equivalent pandas extension dtype.
> Currently, the only solution is first read it into pandas DataFrame (which
> will use floats for the integers if there are nulls), and then afterwards to
> convert those floats back to a nullable integer dtype.
> A possible API for this could look like:
> {code}
> table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})
> {code}
> to indicate that you want to convert all columns of the pyarrow table with
> int64 type to a pandas column using the nullable Int64 dtype.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)