[ 
https://issues.apache.org/jira/browse/ARROW-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15826:
------------------------------------------
    Summary: [Python] Allow serializing arbitrary Python objects to parquet  
(was: Allow serializing arbitrary Python objects to parquet)

> [Python] Allow serializing arbitrary Python objects to parquet
> --------------------------------------------------------------
>
>                 Key: ARROW-15826
>                 URL: https://issues.apache.org/jira/browse/ARROW-15826
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Michael Milton
>            Priority: Major
>
> I'm trying to serialize a pandas DataFrame containing custom objects to 
> parquet. Here is some example code:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> class Foo: 
>     pass
> df = pd.DataFrame({"a": [Foo(), Foo(), Foo()], "b": [1, 2, 3]})
> table = pyarrow.Table.from_pandas(df)
> {code}
> Gives me:
> {code:java}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 594, in dataframe_to_arrays
>     arrays = [convert_column(c, f)
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 594, in <listcomp>
>     arrays = [convert_column(c, f)
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 581, in convert_column
>     raise e
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 575, in convert_column
>     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert <__main__.Foo object at 
> 0x7fc23e38bfd0> with type Foo: did not recognize Python value type when 
> inferring an Arrow data type', 'Conversion failed for column a with type 
> object'){code}
> Now, I realise that there's this disclaimer about arbitrary object 
> serialization: 
> [https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization.]
>  However it isn't clear how this applies to parquet. In my case, I want to 
> have a well-formed parquet file that has binary blobs in one column that 
> _can_ be deserialized to my class, but can otherwise be read by general 
> parquet tools without failing. Using pickle doesn't solve this use case since 
> other languages like R may not be able to read the pickle file.
> Alternatively, if there is a well-defined protocol for telling pyarrow how to 
> translate a given type to and from arrow types, I would be happy to use that 
> instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to