[
https://issues.apache.org/jira/browse/ARROW-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612555#comment-17612555
]
Joris Van den Bossche commented on ARROW-17925:
-----------------------------------------------
To give a concrete copy-pastable example (using the one from the docs:
https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):
{code:python}
from collections import namedtuple
import pyarrow as pa
Point3D = namedtuple("Point3D", ["x", "y", "z"])
class Point3DScalar(pa.ExtensionScalar):
def as_py(self) -> Point3D:
return Point3D(*self.value.as_py())
class Point3DType(pa.PyExtensionType):
def __init__(self):
pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
def __reduce__(self):
return Point3DType, ()
def __arrow_ext_scalar_class__(self):
return Point3DScalar
{code}
{code}
storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)
>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
array([4., 5., 6.], dtype=float32)], dtype=object)
>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]
{code}
So here, {{to_pylist}} gives the nice scalars, while in {{to_pandas()}}, we
have the raw numpy arrays from converting the storage list array.
We _could_ do this automatically in {{to_pandas}} as well if we detect that the
ExtensionType raises NotImplementedError for {{to_pandas_dtype}} and returns a
subclass from {{\_\_arrow_ext_scalar_class\_\_}}.
On the other hand, you can also do this yourself by overriding {{to_pandas()}}?
And what about {{to_numy()}}?
> [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?
> -----------------------------------------------------------------------------
>
> Key: ARROW-17925
> URL: https://issues.apache.org/jira/browse/ARROW-17925
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Joris Van den Bossche
> Priority: Major
>
> This was raised in ARROW-17813 by [~changhiskhan]:
> {quote}*ExtensionArray => pandas*
> Just for discussion, I was curious whether you had any thoughts around using
> the extension scalar as a fallback mechanism. It's a lot simpler to define an
> ExtensionScalar with `as_py` than a pandas extension dtype. So if an
> ExtensionArray doesn't have an equivalent pandas dtype, would it make sense
> to convert it to just an object series whose elements are the result of
> `as_py`? {quote}
> and I also mentioned this in ARROW-17535:
> {quote}That actually brings up a question: if an ExtensionType defines an
> ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy
> conversion), should we use this scalar's {{as_py()}} for the
> to_numpy/to_pandas conversion as well for plain extension arrays? (not the
> nested case)
> Because currently, if you have an ExtensionArray like that (for example using
> the example from the docs:
> https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion),
> we still use the storage type conversion for to_numpy/to_pandas, and only
> use the scalar's conversion in {{to_pylist}}.{quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)