Neal Richardson reassigned ARROW-7569:

    Assignee: Joris Van den Bossche  (was: Neal Richardson)

> [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas 
> conversions
> ---------------------------------------------------------------------------------------
>                 Key: ARROW-7569
>                 URL: https://issues.apache.org/jira/browse/ARROW-7569
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.16.0
>          Time Spent: 20m
>  Remaining Estimate: 0h
> ARROW-2428 was about adding such a mapping, and described three use cases 
> (see this 
> [comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231]
>  for details):
> * Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if 
> the pandas_metadata specify pandas extension dtypes, and if so, use this as 
> the target dtype for that column)
> * Conversion for pyarrow extension types that can define their equivalent 
> pandas extension dtype
> * A way to override default conversion (eg for the built-in types, or in 
> absence of pandas_metadata in the schema). This would require the user to be 
> able to specify some mapping of pyarrow type or column name to the pandas 
> extension dtype to use.
> The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) 
> only covered the first two cases, and not the third case.
> I think it is still interesting to also cover the third case in some way.  
> An example use case are the new nullable dtypes that are introduced in pandas 
> (eg the nullable integer dtype).  Assume I want to read a parquet file into a 
> pandas DataFrame using this nullable integer dtype. The pyarrow Table has no 
> pandas_metadata indicating to use this dtype (unless it was created from a 
> pandas DataFrame that was already using this dtype, but that will often not 
> be the case), and the pyarrow.int64() type is also not an extension type that 
> can define its equivalent pandas extension dtype. 
> Currently, the only solution is first read it into pandas DataFrame (which 
> will use floats for the integers if there are nulls), and then afterwards to 
> convert those floats back to a nullable integer dtype. 
> A possible API for this could look like:
> {code}
> table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})
> {code}
> to indicate that you want to convert all columns of the pyarrow table with 
> int64 type to a pandas column using the nullable Int64 dtype.

This message was sent by Atlassian Jira

Reply via email to