[ 
https://issues.apache.org/jira/browse/ARROW-16348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Wujciak-Jens updated ARROW-16348:
---------------------------------------
    Summary: [Python] ParquetWriter use_compliant_nested_type=True does not 
preserve ExtensionArray when reading back  (was: ParquetWriter 
use_compliant_nested_type=True does not preserve ExtensionArray when reading 
back)

> [Python] ParquetWriter use_compliant_nested_type=True does not preserve 
> ExtensionArray when reading back
> --------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16348
>                 URL: https://issues.apache.org/jira/browse/ARROW-16348
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>         Environment: pyarrow 7.0.0 installed via pip.
>            Reporter: Jim Pivarski
>            Priority: Major
>              Labels: Parquet, parquetReader
>
> I've been happily making ExtensionArrays, but recently noticed that they 
> aren't preserved by round-trips through Parquet files when 
> {{{}use_compliant_nested_type=True{}}}.
> Consider this writer.py:
>  
> {code:java}
> import json
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> class AnnotatedType(pa.ExtensionType):
>     def __init__(self, storage_type, annotation):
>         self.annotation = annotation
>         super().__init__(storage_type, "my:app")
>     def __arrow_ext_serialize__(self):
>         return json.dumps(self.annotation).encode()
>     @classmethod
>     def __arrow_ext_deserialize__(cls, storage_type, serialized):
>         annotation = json.loads(serialized.decode())
>         return cls(storage_type, annotation)
>     @property
>     def num_buffers(self):
>         return self.storage_type.num_buffers
>     @property
>     def num_fields(self):
>         return self.storage_type.num_fields
> pa.register_extension_type(AnnotatedType(pa.null(), None))
> array = pa.Array.from_buffers(
>     AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}),
>     3,
>     [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))],
>     children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])],
> )
> table = pa.table({"": array})
> print(table)
> pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True)
> {code}
> And this reader.py:
>  
> {code:java}
> import json
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> class AnnotatedType(pa.ExtensionType):
>     def __init__(self, storage_type, annotation):
>         self.annotation = annotation
>         super().__init__(storage_type, "my:app")
>     def __arrow_ext_serialize__(self):
>         return json.dumps(self.annotation).encode()
>     @classmethod
>     def __arrow_ext_deserialize__(cls, storage_type, serialized):
>         annotation = json.loads(serialized.decode())
>         return cls(storage_type, annotation)
>     @property
>     def num_buffers(self):
>         return self.storage_type.num_buffers
>     @property
>     def num_fields(self):
>         return self.storage_type.num_fields
> pa.register_extension_type(AnnotatedType(pa.null(), None))
> table = pq.read_table("tmp.parquet")
> print(table)
> {code}
> (The AnnotatedType is the same; I wrote it twice for explicitness.)
> When the writer.py has {{{}use_compliant_nested_type=False{}}}, the output is
> {code:java}
> % python writer.py 
> pyarrow.Table
> : extension<my:app<AnnotatedType>>
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]
> % python reader.py 
> pyarrow.Table
> : extension<my:app<AnnotatedType>>
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
> In other words, the AnnotatedType is preserved. When 
> {{{}use_compliant_nested_type=True{}}}, however,
> {code:java}
> % rm tmp.parquet
> rm: remove regular file 'tmp.parquet'? y
> % python writer.py 
> pyarrow.Table
> : extension<my:app<AnnotatedType>>
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]
> % python reader.py 
> pyarrow.Table
> : list<element: double>
>   child 0, element: double
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
> The issue doesn't seem to be in the writing, but in the reading: regardless 
> of whether {{use_compliant_nested_type}} is {{True}} or {{{}False{}}}, I can 
> see the extension metadata in the Parquet → Arrow converted schema.
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
> : list<item: double>
>   child 0, item: double
>   -- field metadata --
>   ARROW:extension:metadata: '{"cool": "beans"}'
>   ARROW:extension:name: 'my:app'{code}
> versus
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
> : list<element: double>
>   child 0, element: double
>   -- field metadata --
>   ARROW:extension:metadata: '{"cool": "beans"}'
>   ARROW:extension:name: 'my:app'{code}
> Note that the first has "{{{}item: double{}}}" and the second has 
> "{{{}element: double{}}}".
> (I'm also rather surprised that {{use_compliant_nested_type=False}} is an 
> option. Wouldn't you want the Parquet files to always be written with 
> compliant lists? I noticed this when I was having trouble getting the data 
> into BigQuery.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to