[GitHub] [arrow] jorisvandenbossche commented on issue #33727: [Python] array() errors if pandas categorical column has dictionary as string not object

GitBox Wed, 18 Jan 2023 08:10:16 -0800


jorisvandenbossche commented on issue #33727:
URL: https://github.com/apache/arrow/issues/33727#issuecomment-1387323624


   Inside `pa.array(..)`, we convert a pandas.Categorical by converting its 
indices and categories to an Array, and then call 
`pa.DictionaryArray.from_array`. The first step works:
   
   ```
   In [36]: indices = pa.array(df['x'].cat.codes)
   
   In [37]: df["x"].cat.categories.values
   Out[37]: 
   <ArrowStringArray>
   ['bar', 'foo']
   Length: 2, dtype: string
   
   In [39]: dictionary = pa.array(df["x"].cat.categories.values)
   
   In [40]: dictionary
   Out[40]: 
   <pyarrow.lib.ChunkedArray object at 0x7f9e87f0e7a0>
   [
     [
       "bar",
       "foo"
     ]
   ]
   ```
   
   But the converted categories result in a ChunkedArray, not a plain Array, 
and then it is `DictionaryArray.from_arrays` that fails: it expects an Array, 
and if the passed dictionary is not already an Array, try to convert it to one:
   
   ```
   In [43]: pa.DictionaryArray.from_arrays(indices, dictionary)
   ...
   ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type 
pyarrow.lib.StringScalar: did not recognize Python value type when inferring an 
Arrow data type
   
   In [44]: pa.array(dictionary)
   ...
   ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type 
pyarrow.lib.StringScalar: did not recognize Python value type when inferring an 
Arrow data type
   ```
   
   We should probably ensure that `pa.array(..)` returns an Array instead of 
ChunkedArray if there is only one chunk (that is also logic that could live 
inside pandas' `StringArray.__arrow_array__`). 
   I am not sure if our APIs should accept a ChunkedArray (and automatically 
concatenate the chunks?) in `DictionaryArray.from_arrays`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #33727: [Python] array() errors if pandas categorical column has dictionary as string not object

Reply via email to