jorisvandenbossche commented on issue #33727:
URL: https://github.com/apache/arrow/issues/33727#issuecomment-1387323624
Inside `pa.array(..)`, we convert a pandas.Categorical by converting its
indices and categories to an Array, and then call
`pa.DictionaryArray.from_array`. The first step works:
```
In [36]: indices = pa.array(df['x'].cat.codes)
In [37]: df["x"].cat.categories.values
Out[37]:
<ArrowStringArray>
['bar', 'foo']
Length: 2, dtype: string
In [39]: dictionary = pa.array(df["x"].cat.categories.values)
In [40]: dictionary
Out[40]:
<pyarrow.lib.ChunkedArray object at 0x7f9e87f0e7a0>
[
[
"bar",
"foo"
]
]
```
But the converted categories result in a ChunkedArray, not a plain Array,
and then it is `DictionaryArray.from_arrays` that fails: it expects an Array,
and if the passed dictionary is not already an Array, try to convert it to one:
```
In [43]: pa.DictionaryArray.from_arrays(indices, dictionary)
...
ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type
pyarrow.lib.StringScalar: did not recognize Python value type when inferring an
Arrow data type
In [44]: pa.array(dictionary)
...
ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type
pyarrow.lib.StringScalar: did not recognize Python value type when inferring an
Arrow data type
```
We should probably ensure that `pa.array(..)` returns an Array instead of
ChunkedArray if there is only one chunk (that is also logic that could live
inside pandas' `StringArray.__arrow_array__`).
I am not sure if our APIs should accept a ChunkedArray (and automatically
concatenate the chunks?) in `DictionaryArray.from_arrays`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]