[
https://issues.apache.org/jira/browse/ARROW-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-15826:
------------------------------------------
Summary: [Python] Allow serializing arbitrary Python objects to parquet
(was: Allow serializing arbitrary Python objects to parquet)
> [Python] Allow serializing arbitrary Python objects to parquet
> --------------------------------------------------------------
>
> Key: ARROW-15826
> URL: https://issues.apache.org/jira/browse/ARROW-15826
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Parquet, Python
> Reporter: Michael Milton
> Priority: Major
>
> I'm trying to serialize a pandas DataFrame containing custom objects to
> parquet. Here is some example code:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> class Foo:
> pass
> df = pd.DataFrame({"a": [Foo(), Foo(), Foo()], "b": [1, 2, 3]})
> table = pyarrow.Table.from_pandas(df)
> {code}
> Gives me:
> {code:java}
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
> File
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 594, in dataframe_to_arrays
> arrays = [convert_column(c, f)
> File
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 594, in <listcomp>
> arrays = [convert_column(c, f)
> File
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 581, in convert_column
> raise e
> File
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
> line 575, in convert_column
> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
> File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
> File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert <__main__.Foo object at
> 0x7fc23e38bfd0> with type Foo: did not recognize Python value type when
> inferring an Arrow data type', 'Conversion failed for column a with type
> object'){code}
> Now, I realise that there's this disclaimer about arbitrary object
> serialization:
> [https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization.]
> However it isn't clear how this applies to parquet. In my case, I want to
> have a well-formed parquet file that has binary blobs in one column that
> _can_ be deserialized to my class, but can otherwise be read by general
> parquet tools without failing. Using pickle doesn't solve this use case since
> other languages like R may not be able to read the pickle file.
> Alternatively, if there is a well-defined protocol for telling pyarrow how to
> translate a given type to and from arrow types, I would be happy to use that
> instead.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)