[jira] [Commented] (ARROW-17925) [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?

Joris Van den Bossche (Jira) Tue, 04 Oct 2022 03:22:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612555#comment-17612555
 ]


Joris Van den Bossche commented on ARROW-17925:
-----------------------------------------------

To give a concrete copy-pastable example (using the one from the docs: 
https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):

{code:python}
from collections import namedtuple
import pyarrow as pa

Point3D = namedtuple("Point3D", ["x", "y", "z"])

class Point3DScalar(pa.ExtensionScalar):
    def as_py(self) -> Point3D:
        return Point3D(*self.value.as_py())

class Point3DType(pa.PyExtensionType):
    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))

    def __reduce__(self):
        return Point3DType, ()

    def __arrow_ext_scalar_class__(self):
        return Point3DScalar
{code}

{code}
storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)

>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
       array([4., 5., 6.], dtype=float32)], dtype=object)

>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]
{code}

So here, {{to_pylist}} gives the nice scalars, while in {{to_pandas()}}, we 
have the raw numpy arrays from converting the storage list array. 

We _could_ do this automatically in {{to_pandas}} as well if we detect that the 
ExtensionType raises NotImplementedError for {{to_pandas_dtype}} and returns a 
subclass from {{\_\_arrow_ext_scalar_class\_\_}}. 

On the other hand, you can also do this yourself by overriding {{to_pandas()}}? 

And what about {{to_numy()}}?

> [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-17925
>                 URL: https://issues.apache.org/jira/browse/ARROW-17925
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> This was raised in ARROW-17813 by [~changhiskhan]:
> {quote}*ExtensionArray => pandas*
> Just for discussion, I was curious whether you had any thoughts around using 
> the extension scalar as a fallback mechanism. It's a lot simpler to define an 
> ExtensionScalar with `as_py` than a pandas extension dtype. So if an 
> ExtensionArray doesn't have an equivalent pandas dtype, would it make sense 
> to convert it to just an object series whose elements are the result of 
> `as_py`? {quote}
> and I also mentioned this in ARROW-17535:
> {quote}That actually brings up a question: if an ExtensionType defines an 
> ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy 
> conversion), should we use this scalar's {{as_py()}} for the 
> to_numpy/to_pandas conversion as well for plain extension arrays? (not the 
> nested case) 
> Because currently, if you have an ExtensionArray like that (for example using 
> the example from the docs: 
> https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion),
>  we still use the storage type conversion for to_numpy/to_pandas, and only 
> use the scalar's conversion in {{to_pylist}}.{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17925) [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?

Reply via email to