[jira] [Commented] (ARROW-15826) Allow serializing arbitrary Python objects to parquet

Will Jones (Jira) Wed, 02 Mar 2022 19:04:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500469#comment-17500469
 ]


Will Jones commented on ARROW-15826:
------------------------------------

{quote} I want to serialize this in a way that can be read by any other 
language, and therefore pickle does _not_ suffice.
{quote}
So maybe I misunderstood earlier what you meant by serialization. My initial 
impression was that you cared about saving Python object instances, and didn't 
care about portability. For example, some users have wanted to save a 
{{requests.Response}} object in a column as part of web-scraped data; that's a 
case where you have to choose between pickling and keeping all data, or convert 
to a format that other languages could read. But it sounds like you are 
designing the data structure?

Arrow and Parquet support nested types, including struct, list, and map 
columns. So if what you really care about is saving nested data, that's already 
possible. With the dataclass example you gave, all you have to do is convert 
the classes to dict; PyArrow can automatically convert those into nested Arrow 
types. And any parquet reader will be able to understand it.
{code:python}
from dataclasses import dataclass, asdict
import pandas as pd
import pyarrow as pa

@dataclass
class MyClass:
    a: str
    b: int
    c: bool

df = pd.DataFrame({
    "a": [MyClass("a", 1, True), MyClass("b", 2, True), MyClass("c", 3, 
False)], 
    "b": [1, 2, 3]
})

# PyArrow already knows how to convert lists and dicts into nested types
df['a'] = [asdict(x) for x in df['a']]

pa.Table.from_pandas(df)
# pyarrow.Table
# a: struct<a: string, b: int64, c: bool>
#   child 0, a: string
#   child 1, b: int64
#   child 2, c: bool
# b: int64
# ----
# a: [  -- is_valid: all not null  -- child 0 type: string
#     [
#       "a",
#       "b",
#       "c"
#     ]  -- child 1 type: int64
#     [
#       1,
#       2,
#       3
#     ]  -- child 2 type: bool
#     [
#       true,
#       true,
#       false
#     ]]
# b: [[1,2,3]]
 {code}

> Allow serializing arbitrary Python objects to parquet
> -----------------------------------------------------
>
>                 Key: ARROW-15826
>                 URL: https://issues.apache.org/jira/browse/ARROW-15826
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Michael Milton
>            Priority: Major
>
> I'm trying to serialize a pandas DataFrame containing custom objects to 
> parquet. Here is some example code:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> class Foo: 
>     pass
> df = pd.DataFrame({"a": [Foo(), Foo(), Foo()], "b": [1, 2, 3]})
> table = pyarrow.Table.from_pandas(df)
> {code}
> Gives me:
> {code:java}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 594, in dataframe_to_arrays
>     arrays = [convert_column(c, f)
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 594, in <listcomp>
>     arrays = [convert_column(c, f)
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 581, in convert_column
>     raise e
>   File 
> "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py",
>  line 575, in convert_column
>     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert <__main__.Foo object at 
> 0x7fc23e38bfd0> with type Foo: did not recognize Python value type when 
> inferring an Arrow data type', 'Conversion failed for column a with type 
> object'){code}
> Now, I realise that there's this disclaimer about arbitrary object 
> serialization: 
> [https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization.]
>  However it isn't clear how this applies to parquet. In my case, I want to 
> have a well-formed parquet file that has binary blobs in one column that 
> _can_ be deserialized to my class, but can otherwise be read by general 
> parquet tools without failing. Using pickle doesn't solve this use case since 
> other languages like R may not be able to read the pickle file.
> Alternatively, if there is a well-defined protocol for telling pyarrow how to 
> translate a given type to and from arrow types, I would be happy to use that 
> instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15826) Allow serializing arbitrary Python objects to parquet

Reply via email to