to_parquet fails

GitBox Wed, 18 Jan 2023 00:56:23 -0800


AlenkaF commented on issue #33727:
URL: https://github.com/apache/arrow/issues/33727#issuecomment-1386699394


   Thank you for reporting @crusaderky!
   It seems `array()` method can't handle categorical pandas columns if the 
dictionary is `string` type.
   
   The error is triggered in `pandas_compat.py`
   
https://github.com/apache/arrow/blob/f769f6b32373fcf5fc2a7a51152b375127ca4af7/python/pyarrow/pandas_compat.py#L591-L598
   
   due to `array()` method erroring with `ArrowInvalid`:
   
   ```python
   # Works with string series/column
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
   pa.array(df["x"])
   # <pyarrow.lib.ChunkedArray object at 0x12dcbbef0>
   # [
   #   [
   #     "foo",
   #     "bar",
   #     "foo"
   #   ]
   # ]
   
   # Works with categorical with dictionary as object type
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
   df = df.astype("category")
   pa.array(df["x"])
   # <pyarrow.lib.DictionaryArray object at 0x12dbc5ac0>
   
   # -- dictionary:
   #   [
   #     "bar",
   #     "foo"
   #   ]
   # -- indices:
   #   [
   #     1,
   #     0,
   #     1
   #   ]
   
   # Errors if dictionary in categorical column is string
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
   df = df.astype("category")
   pa.array(df["x"])
   # Traceback (most recent call last):
   #   File "<stdin>", line 1, in <module>
   #   File "pyarrow/array.pxi", line 310, in pyarrow.lib.array
   #     return DictionaryArray.from_arrays(
   #   File "pyarrow/array.pxi", line 2608, in 
pyarrow.lib.DictionaryArray.from_arrays
   #     _dictionary = array(dictionary, memory_pool=memory_pool)
   #   File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
   #     result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
   #   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
   #     chunked = GetResultValue(
   #   File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
   #     return check_status(status)
   #   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   #     raise ArrowInvalid(message)
   # pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> 
with type pyarrow.lib.StringScalar: did not recognize Python value type when 
inferring an Arrow data type
   ```
   
   Debugging from `convert_column` in `dataframe_to_arrays` (_pandas_compat.py_)
   ```python
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
   df = df.astype("category")
   dataframe_to_arrays(df, schema=None, preserve_index=None)
   # > 
/Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
   # -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
   (Pdb) col
   # 0    foo
   # 1    bar
   # 2    foo
   # Name: x, dtype: category
   # Categories (2, object): ['bar', 'foo']
   
   df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
   df = df.astype("category")
   dataframe_to_arrays(df, schema=None, preserve_index=None)
   # > 
/Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
   # -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
   (Pdb) col
   # 0    foo
   # 1    bar
   # 2    foo
   # Name: x, dtype: category
   # Categories (2, string): [bar, foo]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF commented on issue #33727: pandas string[pyarrow] -> category -> to_parquet fails

Reply via email to