[
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914231#comment-16914231
]
Joris Van den Bossche edited comment on ARROW-2428 at 8/23/19 12:55 PM:
------------------------------------------------------------------------
I am working on the actual ability to create ExtensionBlocks in the conversion
to pandas (ARROW-6321, [https://github.com/apache/arrow/pull/5162]), but to
complete that work, we also need to solve this issue about how we can know /
the user can indicate which columns to convert to what type.
Below I put some (long) thoughts (you can also read it / comment on [google
docs|https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit#heading=h.dl5issk8bkd6])
about possibilities of how the API could look like to convert back to pandas
ExtensionArrays. Some feedback / ideas on the API is very welcome!
*Conversion arrow -> pandas*
Different use cases and options:
*Case 1: basic roundtrip of pandas ExtensionArrays (without involvement of
arrow ExtensionTypes).* For example, for pandas' nullable integer or for
fletcher's arays, they map to native arrow arrays (they don't need an Arrow
ExtensionType). It would be nice that DataFrames holding such pandas
ExtensionArrays could roundtrip out of the box:
- When converting a DataFrame with such arrays to arrow, we save the pandas
dtype in the metadata (a string representation of it). So we could use this
information to know that certain columns need to be converted back to
ExtensionArrays.
The question is then: how does Arrow know in which extension array to convert
it? We could look for a constructor classmethod on the pandas dtype (like
{{PandasDtype.__constructor_from_arrow__}}), call that and put the returned
pandas ExtensionArray in the block structure pyarrow creates. This would be
kind of the inverse of {{__arrow_array__}}. Some pseudo-code to illustrate:
{code:python}
# the dtype name stored in the pandas metadata
pd_dtype_name = 'Int64' # or 'fletcher[string]'
pd_dtype = pd.api.types.pandas_dtype(pd_dtype_name)
if hasattr(pd_dtype, '__constructor_from_arrow__'):
# indicate to ConvertTableToPandas to use ExtensionBlock for this column
...
arr = ... # the pyarrow Array for this column
ext_arr = pd_dtype.__constructor_from_arrow__(arr)
block = _pd_int.make_block(ext_arr, placement=placement,
klass=_pd_int.ExtensionBlock)
{code}
This will only work when the pandas dtype is registered on the pandas side (so
that the string name is recognized by pandas to re-create the dtype object).
*Case 2: conversion for Arrow ExtensionType.* In case you defined an Arrow
ExtensionType in Python, it would be nice to be able to register a default
conversion to pandas. It could be this extension type that knows how to convert
itself to pandas (which could either be a plain numpy array or a pandas
ExtensionArray).
- We can add a method to the {{pyarrow.ExtensionType}} that converts an array
of its type to a pandas-compatible array. This method can then be overridden by
implementors of a ExtensionType.
- Alternatively, next to defining the {{pyarrow.ExtensionType}}, the user
could register a function to be used for that type (to extend the default arrow
type -> pandas type mapping).
The method on the {{pyarrow.ExtensionType}} would be similar to the
{{__constructor_from_arrow__}} method on the {{pandas.ExtensionDtype}}. So we
could also choose to let this live on the pandas dtype, and then the
{{pyarrow.ExtensionType}} only needs to be mapped to a pandas dtype (but this
means that the {{pyarrow.ExtensionType}} can only be converted to a pandas
extension type).
*Case 3: override the default conversions.* There is a default mapping of
pyarrow types to pandas types in the conversion. This mapping could be
"extended" for ExtensionTypes (see above). But sometimes it will also be useful
to override the default mapping. For example, fletcher wants to have a way to
say to pyarrow: convert all your arrays to fletcher ExtensionArrays instead of
the default numpy types.
- The user could register functions that extend or overrule the default arrow
type -> pandas type mapping. Or it could register pandas dtypes per arrow type,
and then (similar to above) that pandas dtype can know how to convert itself.
- Alternatively, the {{Table.to_pandas}} method could also gain a {{dtype}}
keyword where you can specify the target dtype on a column basis.
was (Author: jorisvandenbossche):
I am working on the actual ability to create ExtensionBlocks in the conversion
to pandas (ARROW-6321, [https://github.com/apache/arrow/pull/5162]), but to
complete that work, we also need to solve this issue about how we can know /
the user can indicate which columns to convert to what type.
Below I put some (long) thoughts (you can also read it / comment on [google
docs|https://docs.google.com/document/d/1pr9PuBfXTdlUoAgyh9zPIKDJZalDLI6GuxqblMynMM8/edit#heading=h.dl5issk8bkd6])
about possibilities of how the API could look like to convert back to pandas
ExtensionArrays.
*Conversion arrow -> pandas*
Different use cases and options:
*Case 1: basic roundtrip of pandas ExtensionArrays (without involvement of
arrow ExtensionTypes).* For example, for pandas' nullable integer or for
fletcher's arays, they map to native arrow arrays (they don't need an Arrow
ExtensionType). It would be nice that DataFrames holding such pandas
ExtensionArrays could roundtrip out of the box:
- When converting a DataFrame with such arrays to arrow, we save the pandas
dtype in the metadata (a string representation of it). So we could use this
information to know that certain columns need to be converted back to
ExtensionArrays.
The question is then: how does Arrow know in which extension array to convert
it? We could look for a constructor classmethod on the pandas dtype (like
{{PandasDtype.\_\_constructor_from_arrow\_\_}}), call that and put the returned
pandas ExtensionArray in the block structure pyarrow creates. This would be
kind of the inverse of {{\_\_arrow_array\_\_}}. Some pseudo-code to illustrate:
{code:python}
# the dtype name stored in the pandas metadata
pd_dtype_name = 'Int64' # or 'fletcher[string]'
pd_dtype = pd.api.types.pandas_dtype(pd_dtype_name)
if hasattr(pd_dtype, '__constructor_from_arrow__'):
# indicate to ConvertTableToPandas to use ExtensionBlock for this column
...
arr = ... # the pyarrow Array for this column
ext_arr = pd_dtype.__constructor_from_arrow__(arr)
block = _pd_int.make_block(ext_arr, placement=placement,
klass=_pd_int.ExtensionBlock)
{code}
This will only work when the pandas dtype is registered on the pandas side (so
that the string name is recognized by pandas to re-create the dtype object).
*Case 2: conversion for Arrow ExtensionType.* In case you defined an Arrow
ExtensionType in Python, it would be nice to be able to register a default
conversion to pandas. It could be this extension type that knows how to convert
itself to pandas (which could either be a plain numpy array or a pandas
ExtensionArray).
- We can add a method to the {{pyarrow.ExtensionType}} that converts an array
of its type to a pandas-compatible array. This method can then be overridden by
implementors of a ExtensionType.
- Alternatively, next to defining the {{pyarrow.ExtensionType}}, the user
could register a function to be used for that type (to extend the default arrow
type -> pandas type mapping).
The method on the {{pyarrow.ExtensionType}} would be similar to the
{{\_\_constructor_from_arrow\_\_}} method on the {{pandas.ExtensionDtype}}. So
we could also choose to let this live on the pandas dtype, and then the
{{pyarrow.ExtensionType}} only needs to be mapped to a pandas dtype (but this
means that the {{pyarrow.ExtensionType}} can only be converted to a pandas
extension type).
*Case 3: override the default conversions.* There is a default mapping of
pyarrow types to pandas types in the conversion. This mapping could be
"extended" for ExtensionTypes (see above). But sometimes it will also be useful
to override the default mapping. For example, fletcher wants to have a way to
say to pyarrow: convert all your arrays to fletcher ExtensionArrays instead of
the default numpy types.
- The user could register functions that extend or overrule the default arrow
type -> pandas type mapping. Or it could register pandas dtypes per arrow type,
and then (similar to above) that pandas dtype can know how to convert itself.
- Alternatively, the {{Table.to_pandas}} method could also gain a {{dtype}}
keyword where you can specify the target dtype on a column basis.
> [Python] Add API to map Arrow types (including extension types) to pandas
> ExtensionArray instances for to_pandas conversions
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-2428
> URL: https://issues.apache.org/jira/browse/ARROW-2428
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Uwe L. Korn
> Priority: Major
> Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column
> types that back a {{pandas.Series}}. Thus we will not be able to cover all
> possible column types in the {{to_pandas}} conversion by default as we won't
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}}
> call where they can overload the default conversion routines with the ones
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first
> convert the Arrow column into a default Pandas column (probably of object
> type) and the user would afterwards convert it to a more efficient
> {{ExtensionArray}}. This hook here will be especially useful when you build
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is:
> https://github.com/pandas-dev/pandas/issues/19696
--
This message was sent by Atlassian Jira
(v8.3.2#803003)