[jira] [Comment Edited] (ARROW-2428) [Python] Add API to map Arrow types (including extension types) to pandas ExtensionArray instances for to_pandas conversions

Joris Van den Bossche (Jira) Fri, 23 Aug 2019 05:56:49 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914231#comment-16914231
 ]


Joris Van den Bossche edited comment on ARROW-2428 at 8/23/19 12:55 PM:
------------------------------------------------------------------------

I am working on the actual ability to create ExtensionBlocks in the conversion 
to pandas (ARROW-6321, [https://github.com/apache/arrow/pull/5162]), but to 
complete that work, we also need to solve this issue about how we can know / 
the user can indicate which columns to convert to what type.

Below I put some (long) thoughts (you can also read it / comment on [google 
docs|https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit#heading=h.dl5issk8bkd6])
 about possibilities of how the API could look like to convert back to pandas 
ExtensionArrays. Some feedback / ideas on the API is very welcome!

*Conversion arrow -> pandas*

Different use cases and options:

*Case 1: basic roundtrip of pandas ExtensionArrays (without involvement of 
arrow ExtensionTypes).* For example, for pandas' nullable integer or for 
fletcher's arays, they map to native arrow arrays (they don't need an Arrow 
ExtensionType). It would be nice that DataFrames holding such pandas 
ExtensionArrays could roundtrip out of the box:
 - When converting a DataFrame with such arrays to arrow, we save the pandas 
dtype in the metadata (a string representation of it). So we could use this 
information to know that certain columns need to be converted back to 
ExtensionArrays.
 The question is then: how does Arrow know in which extension array to convert 
it? We could look for a constructor classmethod on the pandas dtype (like 
{{PandasDtype.__constructor_from_arrow__}}), call that and put the returned 
pandas ExtensionArray in the block structure pyarrow creates. This would be 
kind of the inverse of {{__arrow_array__}}. Some pseudo-code to illustrate:

{code:python}
# the dtype name stored in the pandas metadata
pd_dtype_name = 'Int64'  # or 'fletcher[string]'
pd_dtype = pd.api.types.pandas_dtype(pd_dtype_name)
if hasattr(pd_dtype, '__constructor_from_arrow__'):
    # indicate to ConvertTableToPandas to use ExtensionBlock for this column
    ...
    arr = ... # the pyarrow Array for this column
    ext_arr = pd_dtype.__constructor_from_arrow__(arr)
    block = _pd_int.make_block(ext_arr, placement=placement,
                               klass=_pd_int.ExtensionBlock)
{code}
This will only work when the pandas dtype is registered on the pandas side (so 
that the string name is recognized by pandas to re-create the dtype object).

*Case 2: conversion for Arrow ExtensionType.* In case you defined an Arrow 
ExtensionType in Python, it would be nice to be able to register a default 
conversion to pandas. It could be this extension type that knows how to convert 
itself to pandas (which could either be a plain numpy array or a pandas 
ExtensionArray).
 - We can add a method to the {{pyarrow.ExtensionType}} that converts an array 
of its type to a pandas-compatible array. This method can then be overridden by 
implementors of a ExtensionType.
 - Alternatively, next to defining the {{pyarrow.ExtensionType}}, the user 
could register a function to be used for that type (to extend the default arrow 
type -> pandas type mapping).

The method on the {{pyarrow.ExtensionType}} would be similar to the 
{{__constructor_from_arrow__}} method on the {{pandas.ExtensionDtype}}. So we 
could also choose to let this live on the pandas dtype, and then the 
{{pyarrow.ExtensionType}} only needs to be mapped to a pandas dtype (but this 
means that the {{pyarrow.ExtensionType}} can only be converted to a pandas 
extension type).

*Case 3: override the default conversions.* There is a default mapping of 
pyarrow types to pandas types in the conversion. This mapping could be 
"extended" for ExtensionTypes (see above). But sometimes it will also be useful 
to override the default mapping. For example, fletcher wants to have a way to 
say to pyarrow: convert all your arrays to fletcher ExtensionArrays instead of 
the default numpy types.
 - The user could register functions that extend or overrule the default arrow 
type -> pandas type mapping. Or it could register pandas dtypes per arrow type, 
and then (similar to above) that pandas dtype can know how to convert itself.
 - Alternatively, the {{Table.to_pandas}} method could also gain a {{dtype}} 
keyword where you can specify the target dtype on a column basis.


was (Author: jorisvandenbossche):
I am working on the actual ability to create ExtensionBlocks in the conversion 
to pandas (ARROW-6321, [https://github.com/apache/arrow/pull/5162]), but to 
complete that work, we also need to solve this issue about how we can know / 
the user can indicate which columns to convert to what type.

Below I put some (long) thoughts (you can also read it / comment on [google 
docs|https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit#heading=h.dl5issk8bkd6])
 about possibilities of how the API could look like to convert back to pandas 
ExtensionArrays.

*Conversion arrow -> pandas*

Different use cases and options:

*Case 1: basic roundtrip of pandas ExtensionArrays (without involvement of 
arrow ExtensionTypes).* For example, for pandas' nullable integer or for 
fletcher's arays, they map to native arrow arrays (they don't need an Arrow 
ExtensionType). It would be nice that DataFrames holding such pandas 
ExtensionArrays could roundtrip out of the box:
 - When converting a DataFrame with such arrays to arrow, we save the pandas 
dtype in the metadata (a string representation of it). So we could use this 
information to know that certain columns need to be converted back to 
ExtensionArrays.
 The question is then: how does Arrow know in which extension array to convert 
it? We could look for a constructor classmethod on the pandas dtype (like 
{{PandasDtype.\_\_constructor_from_arrow\_\_}}), call that and put the returned 
pandas ExtensionArray in the block structure pyarrow creates. This would be 
kind of the inverse of {{\_\_arrow_array\_\_}}. Some pseudo-code to illustrate:

{code:python}
# the dtype name stored in the pandas metadata
pd_dtype_name = 'Int64'  # or 'fletcher[string]'
pd_dtype = pd.api.types.pandas_dtype(pd_dtype_name)
if hasattr(pd_dtype, '__constructor_from_arrow__'):
    # indicate to ConvertTableToPandas to use ExtensionBlock for this column
    ...
    arr = ... # the pyarrow Array for this column
    ext_arr = pd_dtype.__constructor_from_arrow__(arr)
    block = _pd_int.make_block(ext_arr, placement=placement,
                               klass=_pd_int.ExtensionBlock)
{code}

This will only work when the pandas dtype is registered on the pandas side (so 
that the string name is recognized by pandas to re-create the dtype object).

*Case 2: conversion for Arrow ExtensionType.* In case you defined an Arrow 
ExtensionType in Python, it would be nice to be able to register a default 
conversion to pandas. It could be this extension type that knows how to convert 
itself to pandas (which could either be a plain numpy array or a pandas 
ExtensionArray).
 - We can add a method to the {{pyarrow.ExtensionType}} that converts an array 
of its type to a pandas-compatible array. This method can then be overridden by 
implementors of a ExtensionType.
 - Alternatively, next to defining the {{pyarrow.ExtensionType}}, the user 
could register a function to be used for that type (to extend the default arrow 
type -> pandas type mapping).

The method on the {{pyarrow.ExtensionType}} would be similar to the 
{{\_\_constructor_from_arrow\_\_}} method on the {{pandas.ExtensionDtype}}. So 
we could also choose to let this live on the pandas dtype, and then the 
{{pyarrow.ExtensionType}} only needs to be mapped to a pandas dtype (but this 
means that the {{pyarrow.ExtensionType}} can only be converted to a pandas 
extension type).

*Case 3: override the default conversions.* There is a default mapping of 
pyarrow types to pandas types in the conversion. This mapping could be 
"extended" for ExtensionTypes (see above). But sometimes it will also be useful 
to override the default mapping. For example, fletcher wants to have a way to 
say to pyarrow: convert all your arrays to fletcher ExtensionArrays instead of 
the default numpy types.
 - The user could register functions that extend or overrule the default arrow 
type -> pandas type mapping. Or it could register pandas dtypes per arrow type, 
and then (similar to above) that pandas dtype can know how to convert itself.
 - Alternatively, the {{Table.to_pandas}} method could also gain a {{dtype}} 
keyword where you can specify the target dtype on a column basis.

> [Python] Add API to map Arrow types (including extension types) to pandas 
> ExtensionArray instances for to_pandas conversions
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2428
>                 URL: https://issues.apache.org/jira/browse/ARROW-2428
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Uwe L. Korn
>            Priority: Major
>             Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column 
> types that back a {{pandas.Series}}. Thus we will not be able to cover all 
> possible column types in the {{to_pandas}} conversion by default as we won't 
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in 
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
> call where they can overload the default conversion routines with the ones 
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first 
> convert the Arrow column into a default Pandas column (probably of object 
> type) and the user would afterwards convert it to a more efficient 
> {{ExtensionArray}}. This hook here will be especially useful when you build 
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is: 
> https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Comment Edited] (ARROW-2428) [Python] Add API to map Arrow types (including extension types) to pandas ExtensionArray instances for to_pandas conversions

Reply via email to