[ 
https://issues.apache.org/jira/browse/ARROW-8017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403079#comment-17403079
 ] 

Joris Van den Bossche edited comment on ARROW-8017 at 8/23/21, 9:38 AM:
------------------------------------------------------------------------

A very late response, but the issue here is that pyarrow doesn't support 
converting custom Python objects (because we can't store them in one of the 
pyarrow data types). Thus in practice, we only support object dtype columns 
that contain basic numbers of strings. 

Small example with a custom class:

{code:python}
class MyClass:
    pass

# putting a custom class in a pandas Series works, this is stored in an 
"object" dtype column
In [16]: s = pd.Series([MyClass()])

In [17]: s
Out[17]: 
0    <__main__.MyClass object at 0x7f887ad3f850>
dtype: object

# but converting such object dtype data to pyarrow isn't supported
In [18]: pa.array(s)
...
ArrowInvalid: Could not convert <__main__.MyClass object at 0x7f887ad3f850> 
with type MyClass: did not recognize Python value type when inferring an Arrow 
data type
{code}

The same happens here with the {{pathlib.Path}} object in your example.


was (Author: jorisvandenbossche):
A very late response, but the issue here is that pyarrow doesn't support 
converting custom Python objects (because we can't store them in one of the 
pyarrow data types). Thus in practice, we only support object dtype columns 
that contain basic numbers of strings. 

Small example with a custom class:

{code:python}
class MyClass:
    pass

# putting a custom class in a pandas Series works, this is stored in an 
"object" dtype column
In [16]: s = pd.Series([MyClass()])

In [17]: s
Out[17]: 
0    <__main__.MyClass object at 0x7f887ad3f850>
dtype: object

# but converting such object dtype data to pyarrow isn't supported
In [18]: pa.array(s)
...
ArrowInvalid: Could not convert <__main__.MyClass object at 0x7f887ad3f850> 
with type MyClass: did not recognize Python value type when inferring an Arrow 
data type
{code}

> [Python] Pyarrow no support for pathlib Path with table = 
> pa.Table.from_pandas() or pd.to_parquet()
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8017
>                 URL: https://issues.apache.org/jira/browse/ARROW-8017
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Conda :
> arrow-cpp                 0.15.1           py38h7cd5009_5
> numba                     0.48.0           py38h0573a6f_0  
> numpy                     1.18.1           py38h4f9e942_0  
> numpy-base                1.18.1           py38hde5b4d6_1
> pandas                    1.0.1            py38h0573a6f_0
> pyarrow                   0.15.1           py38h0573a6f_0
> pycparser                 2.19                       py_0
> python                    3.8.1                h0371630_1  
> python-dateutil           2.8.1                      py_0
>            Reporter: Iemand
>            Priority: Minor
>              Labels: features
>
> Trying to store a table with Python's pathlib Path will give an ArrowInvalid:
> {{ArrowInvalid: ('Could not convert foo/spam.wav with type PosixPath: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column filepath with type object')}}
> {{Pandas approach:}}
> {code:python}
> import pandas as pd
> df_test = pd.DataFrame({"filepath": [Path("foo", "spam.wav")]})
> df_test.to_parquet("egg.parquet"){code}
>  
> {{Parquet approach}}
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.Table.from_pandas(df_test)  # fails here
> # pq.write_table(table, 'egg.parquet') # , version='2.0'
> {code}
>  
> {{Full error Traceback of }}{{pa.Table.from_pandas}}
> {code:python}
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <ipython-input-220-bce69439945e> in <module>
>       2 import pyarrow.parquet as pq
>       3 
> ----> 4 table = pa.Table.from_pandas(df_test)
>       5 pq.write_table(table, 'egg.parquet', version='2.0')
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
>  in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
>     552 
>     553     if nthreads == 1:
> --> 554         arrays = [convert_column(c, f)
>     555                   for c, f in zip(columns_to_convert, convert_fields)]
>     556     else:
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
>  in <listcomp>(.0)
>     552 
>     553     if nthreads == 1:
> --> 554         arrays = [convert_column(c, f)
>     555                   for c, f in zip(columns_to_convert, convert_fields)]
>     556     else:
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
>  in convert_column(col, field)
>     544             e.args += ("Conversion failed for column {0!s} with type 
> {1!s}"
>     545                        .format(col.name, col.dtype),)
> --> 546             raise e
>     547         if not field_nullable and result.null_count > 0:
>     548             raise ValueError("Field {} was non-nullable but pandas 
> column "
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/pandas_compat.py
>  in convert_column(col, field)
>     538 
>     539         try:
> --> 540             result = pa.array(col, type=type_, from_pandas=True, 
> safe=safe)
>     541         except (pa.ArrowInvalid,
>     542                 pa.ArrowNotImplementedError,
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/array.pxi in 
> pyarrow.lib.array()
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ~/anaconda3/envs/soundrhythm/lib/python3.8/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowInvalid: ('Could not convert foo/spam.wav with type PosixPath: did not 
> recognize Python value type when inferring an Arrow data type', 'Conversion 
> failed for column filepath with type object'){code}
> Might be related to https://issues.apache.org/jira/browse/ARROW-2046 , 
> although that was about file save location.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to