[ 
https://issues.apache.org/jira/browse/ARROW-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17609608#comment-17609608
 ] 

Chang She commented on ARROW-17813:
-----------------------------------

[~jorisvandenbossche] thank you for the details above!

*ExtensionArray => pandas*

Just for discussion, I was curious whether you had any thoughts around using 
the extension scalar as a fallback mechanism. It's a lot simpler to define an 
ExtensionScalar with `as_py` than a pandas extension dtype. So if an 
ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to 
convert it to just an object series whose elements are the result of `as_py`? I 
added it as a comment to ARROW-17353 for further discussion as well if it makes 
sense.

*pandas/numpy => Arrow*

{quote}One way this will be a bit easier is to cast to the final type, 
something like: list_of_storage.cast(pa.list_(LabelType())).{quote}

Yeah, that would certainly make it a lot more convenient! I don't see any tests 
relating to nested types in https://github.com/apache/arrow/pull/14106 but 
hopefully it's not much additional effort on top of what's already there?
 
{quote} this could be the equivalent of 
`pa.ExtensionArray.from_storage(LabelType(), pa.array(["dog", "cat", 
"horse"]))` ?
 >>> pa.array(["dog", "cat", "horse"], type=LabelType())
 ArrowNotImplementedError: extension
 I opened ARROW-17834 for this. 
{quote}

Agreed. Thanks for opening the JIRA. One additional tricky thing here is what 
if the storage array also need additional arguments. e.g., in CV, most 
canonical datasets has a predetermined dictionary, so for the above example, 
often-times you'd want read in a CSV data dictionary and pass in the class 
names in the right order to construct the storage DictionaryArray (cross-posted 
on ARROW-17834).


{quote}If the above works, I think it should also work to specify a schema with 
the extension type in the Table.from_pandas conversion.
(we could still make it easier to allow to specify the type for one specific 
column, instead of having to specify the full schema){quote}

yeah that would be amazing. I'd love to toss away my custom type conversion 
code that's hard to maintain (and not to mention slow)  :)

> [Python] Nested ExtensionArray conversion to/from pandas/numpy
> --------------------------------------------------------------
>
>                 Key: ARROW-17813
>                 URL: https://issues.apache.org/jira/browse/ARROW-17813
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Chang She
>            Assignee: Miles Granger
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> user@ thread: 
> [https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb]
> repro gist: 
> [https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9]
> *Arrow => numpy/pandas*
> For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to 
> the storage type (as expected). However this is not done for nested arrays:
> {code:python}
> import pyarrow as pa
> class LabelType(pa.ExtensionType):
>     def __init__(self):
>         super(LabelType, self).__init__(pa.string(), "label")
>     def __arrow_ext_serialize__(self):
>         return b""
>     @classmethod
>     def __arrow_ext_deserialize__(cls, storage_type, serialized):
>         return LabelType()
>     
> storage = pa.array(["dog", "cat", "horse"])
> ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage)
> offsets = pa.array([0, 1])
> list_arr = pa.ListArray.from_arrays(offsets, ext_arr)
> list_arr.to_numpy()
> {code}
> {code:java}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> Cell In [15], line 1
> ----> 1 list_arr.to_numpy()
> File 
> /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, 
> in pyarrow.lib.Array.to_numpy()
> File 
> /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for Arrow list to pandas: 
> extension<label<LabelType>>
> {code}
> As mentioned on the user thread linked from the top, a fairly generic 
> solution would just have the conversion default to the storage array's 
> to_numpy.
>  
> *pandas/numpy => Arrow*
> Equivalently, conversion to Arrow is also difficult for nested extension 
> types: 
> if I have say a pandas DataFrame that has a column of list-of-string and I 
> want to convert that to list-of-label Array. Currently I have to:
> 1. Convert to list-of-string (storage) numpy array to pa.list_(pa.string())
> 2. Convert the string values array to ExtensionArray, then reconstitue a 
> list<extension> array using the ExtensionArray combined with the offsets from 
> the result of step 1
> {code:python}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", 
> "car", "car"]]})
> list_of_storage = pa.array(df.labels)
> ext_values = pa.ExtensionArray.from_storage(LabelType(), 
> list_of_storage.values)
> list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, 
> values=ext_values)
> {code}
> For non-nested columns, one can achieve easier conversion by defining a 
> pandas extension dtype, but i don't think that works for a nested column. You 
> would instead have to fallback to something like 
> `pa.ExtensionArray.from_storage` (or `from_pandas`?) to do the trick. Even 
> that doesn't necessarily work for something like a dictionary column because 
> you'd have to pass in the dictionary somehow. Off the cuff, one could provide 
> a custom lambda to `pa.Table.from_pandas` that is used for either specified 
> column names / data types?
> Thanks in advance for the consideration!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to