AlenkaF commented on issue #33727: URL: https://github.com/apache/arrow/issues/33727#issuecomment-1386699394
Thank you for reporting @crusaderky! It seems `array()` method can't handle categorical pandas columns if the dictionary is `string` type. The error is triggered in `pandas_compat.py` https://github.com/apache/arrow/blob/f769f6b32373fcf5fc2a7a51152b375127ca4af7/python/pyarrow/pandas_compat.py#L591-L598 due to `array()` method erroring with `ArrowInvalid`: ```python # Works with string series/column df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]") pa.array(df["x"]) # <pyarrow.lib.ChunkedArray object at 0x12dcbbef0> # [ # [ # "foo", # "bar", # "foo" # ] # ] # Works with categorical with dictionary as object type df = pd.DataFrame({"x": ["foo", "bar", "foo"]}) df = df.astype("category") pa.array(df["x"]) # <pyarrow.lib.DictionaryArray object at 0x12dbc5ac0> # -- dictionary: # [ # "bar", # "foo" # ] # -- indices: # [ # 1, # 0, # 1 # ] # Errors if dictionary in categorical column is string df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]") df = df.astype("category") pa.array(df["x"]) # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "pyarrow/array.pxi", line 310, in pyarrow.lib.array # return DictionaryArray.from_arrays( # File "pyarrow/array.pxi", line 2608, in pyarrow.lib.DictionaryArray.from_arrays # _dictionary = array(dictionary, memory_pool=memory_pool) # File "pyarrow/array.pxi", line 320, in pyarrow.lib.array # result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas) # File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array # chunked = GetResultValue( # File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status # return check_status(status) # File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status # raise ArrowInvalid(message) # pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type ``` Debugging from `convert_column` in `dataframe_to_arrays` (_pandas_compat.py_) ```python df = pd.DataFrame({"x": ["foo", "bar", "foo"]}) df = df.astype("category") dataframe_to_arrays(df, schema=None, preserve_index=None) # > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column() # -> result = pa.array(col, type=type_, from_pandas=True, safe=safe) (Pdb) col # 0 foo # 1 bar # 2 foo # Name: x, dtype: category # Categories (2, object): ['bar', 'foo'] df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]") df = df.astype("category") dataframe_to_arrays(df, schema=None, preserve_index=None) # > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column() # -> result = pa.array(col, type=type_, from_pandas=True, safe=safe) (Pdb) col # 0 foo # 1 bar # 2 foo # Name: x, dtype: category # Categories (2, string): [bar, foo] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org