[
https://issues.apache.org/jira/browse/ARROW-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403079#comment-17403079
]
Joris Van den Bossche commented on ARROW-8017:
----------------------------------------------
A very late response, but the issue here is that pyarrow doesn't support
converting custom Python objects (because we can't store them in one of the
pyarrow data types). Thus in practice, we only support object dtype columns
that contain basic numbers of strings.
Small example with a custom class:
{code:python}
class MyClass:
pass
# putting a custom class in a pandas Series works, this is stored in an
"object" dtype column
In [16]: s = pd.Series([MyClass()])
In [17]: s
Out[17]:
0 <__main__.MyClass object at 0x7f887ad3f850>
dtype: object
# but converting such object dtype data to pyarrow isn't supported
In [18]: pa.array(s)
...
ArrowInvalid: Could not convert <__main__.MyClass object at 0x7f887ad3f850>
with type MyClass: did not recognize Python value type when inferring an Arrow
data type
{code}
> [Python] Pyarrow no support for pathlib Path with table =
> pa.Table.from_pandas() or pd.to_parquet()
> ---------------------------------------------------------------------------------------------------
>
> Key: ARROW-8017
> URL: https://issues.apache.org/jira/browse/ARROW-8017
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Environment: Conda :
> arrow-cpp 0.15.1 py38h7cd5009_5
> numba 0.48.0 py38h0573a6f_0
> numpy 1.18.1 py38h4f9e942_0
> numpy-base 1.18.1 py38hde5b4d6_1
> pandas 1.0.1 py38h0573a6f_0
> pyarrow 0.15.1 py38h0573a6f_0
> pycparser 2.19 py_0
> python 3.8.1 h0371630_1
> python-dateutil 2.8.1 py_0
> Reporter: Iemand
> Priority: Minor
> Labels: features
>
> Trying to store a table with Python's pathlib Path will give an ArrowInvalid:
> {{ArrowInvalid: ('Could not convert foo/spam.wav with type PosixPath: did not
> recognize Python value type when inferring an Arrow data type', 'Conversion
> failed for column filepath with type object')}}
> {{Pandas approach:}}
> {code:python}
> import pandas as pd
> df_test = pd.DataFrame({"filepath": [Path("foo", "spam.wav")]})
> df_test.to_parquet("egg.parquet"){code}
>
> {{Parquet approach}}
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.Table.from_pandas(df_test) # fails here
> # pq.write_table(table, 'egg.parquet') # , version='2.0'
> {code}
>
> {{Full error Traceback of }}{{pa.Table.from_pandas}}
> {code:python}
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> <ipython-input-220-bce69439945e> in <module>
> 2 import pyarrow.parquet as pq
> 3
> ----> 4 table = pa.Table.from_pandas(df_test)
> 5 pq.write_table(table, 'egg.parquet', version='2.0')
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/table.pxi in
> pyarrow.lib.Table.from_pandas()
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 552
> 553 if nthreads == 1:
> --> 554 arrays = [convert_column(c, f)
> 555 for c, f in zip(columns_to_convert, convert_fields)]
> 556 else:
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in <listcomp>(.0)
> 552
> 553 if nthreads == 1:
> --> 554 arrays = [convert_column(c, f)
> 555 for c, f in zip(columns_to_convert, convert_fields)]
> 556 else:
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in convert_column(col, field)
> 544 e.args += ("Conversion failed for column {0!s} with type
> {1!s}"
> 545 .format(col.name, col.dtype),)
> --> 546 raise e
> 547 if not field_nullable and result.null_count > 0:
> 548 raise ValueError("Field {} was non-nullable but pandas
> column "
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
> in convert_column(col, field)
> 538
> 539 try:
> --> 540 result = pa.array(col, type=type_, from_pandas=True,
> safe=safe)
> 541 except (pa.ArrowInvalid,
> 542 pa.ArrowNotImplementedError,
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/array.pxi in
> pyarrow.lib.array()
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/error.pxi in
> pyarrow.lib.check_status()
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/error.pxi in
> pyarrow.lib.check_status()
> ArrowInvalid: ('Could not convert foo/spam.wav with type PosixPath: did not
> recognize Python value type when inferring an Arrow data type', 'Conversion
> failed for column filepath with type object'){code}
> Might be related to https://issues.apache.org/jira/browse/ARROW-2046 ,
> although that was about file save location.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)