AlenkaF commented on issue #39437:
URL: https://github.com/apache/arrow/issues/39437#issuecomment-1880694059

   The error in the `test_categories_with_string_pyarrow_dtype` is due to a 
difference in the PyArrow array data type when being converted from pandas 
`string[pyarrow]`. For pandas version `2.1.4` I get:
   
   ```python
   >>> df1 = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
   >>> df1 = df1.astype("category")
   
   >>> df2 = pd.DataFrame({"x": ["foo", "bar", "foo"]})
   >>> df2 = df2.astype("category")
   
   >>> pa.array(df1["x"]).type
   DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
   >>> pa.array(df2["x"]).type
   DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
   ```
   
   and for the dev version I get:
   ```python
   >>> pa.array(df1["x"]).type
   DictionaryType(dictionary<values=large_string, indices=int8, ordered=0>)
   >>> pa.array(df2["x"]).type
   DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
   ```
   
   meaning arrow string dtype gets converted to a large string with pandas dev 
hence raising an error. The PR that caused the change on the pandas side: 
https://github.com/pandas-dev/pandas/pull/56220
   
   Will update the test to reflect the change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to