[
https://issues.apache.org/jira/browse/ARROW-16348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jacob Wujciak-Jens updated ARROW-16348:
---------------------------------------
Summary: [Python] ParquetWriter use_compliant_nested_type=True does not
preserve ExtensionArray when reading back (was: ParquetWriter
use_compliant_nested_type=True does not preserve ExtensionArray when reading
back)
> [Python] ParquetWriter use_compliant_nested_type=True does not preserve
> ExtensionArray when reading back
> --------------------------------------------------------------------------------------------------------
>
> Key: ARROW-16348
> URL: https://issues.apache.org/jira/browse/ARROW-16348
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Environment: pyarrow 7.0.0 installed via pip.
> Reporter: Jim Pivarski
> Priority: Major
> Labels: Parquet, parquetReader
>
> I've been happily making ExtensionArrays, but recently noticed that they
> aren't preserved by round-trips through Parquet files when
> {{{}use_compliant_nested_type=True{}}}.
> Consider this writer.py:
>
> {code:java}
> import json
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> class AnnotatedType(pa.ExtensionType):
> def __init__(self, storage_type, annotation):
> self.annotation = annotation
> super().__init__(storage_type, "my:app")
> def __arrow_ext_serialize__(self):
> return json.dumps(self.annotation).encode()
> @classmethod
> def __arrow_ext_deserialize__(cls, storage_type, serialized):
> annotation = json.loads(serialized.decode())
> return cls(storage_type, annotation)
> @property
> def num_buffers(self):
> return self.storage_type.num_buffers
> @property
> def num_fields(self):
> return self.storage_type.num_fields
> pa.register_extension_type(AnnotatedType(pa.null(), None))
> array = pa.Array.from_buffers(
> AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}),
> 3,
> [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))],
> children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])],
> )
> table = pa.table({"": array})
> print(table)
> pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True)
> {code}
> And this reader.py:
>
> {code:java}
> import json
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> class AnnotatedType(pa.ExtensionType):
> def __init__(self, storage_type, annotation):
> self.annotation = annotation
> super().__init__(storage_type, "my:app")
> def __arrow_ext_serialize__(self):
> return json.dumps(self.annotation).encode()
> @classmethod
> def __arrow_ext_deserialize__(cls, storage_type, serialized):
> annotation = json.loads(serialized.decode())
> return cls(storage_type, annotation)
> @property
> def num_buffers(self):
> return self.storage_type.num_buffers
> @property
> def num_fields(self):
> return self.storage_type.num_fields
> pa.register_extension_type(AnnotatedType(pa.null(), None))
> table = pq.read_table("tmp.parquet")
> print(table)
> {code}
> (The AnnotatedType is the same; I wrote it twice for explicitness.)
> When the writer.py has {{{}use_compliant_nested_type=False{}}}, the output is
> {code:java}
> % python writer.py
> pyarrow.Table
> : extension<my:app<AnnotatedType>>
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]
> % python reader.py
> pyarrow.Table
> : extension<my:app<AnnotatedType>>
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
> In other words, the AnnotatedType is preserved. When
> {{{}use_compliant_nested_type=True{}}}, however,
> {code:java}
> % rm tmp.parquet
> rm: remove regular file 'tmp.parquet'? y
> % python writer.py
> pyarrow.Table
> : extension<my:app<AnnotatedType>>
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]
> % python reader.py
> pyarrow.Table
> : list<element: double>
> child 0, element: double
> ----
> : [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
> The issue doesn't seem to be in the writing, but in the reading: regardless
> of whether {{use_compliant_nested_type}} is {{True}} or {{{}False{}}}, I can
> see the extension metadata in the Parquet → Arrow converted schema.
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
> : list<item: double>
> child 0, item: double
> -- field metadata --
> ARROW:extension:metadata: '{"cool": "beans"}'
> ARROW:extension:name: 'my:app'{code}
> versus
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
> : list<element: double>
> child 0, element: double
> -- field metadata --
> ARROW:extension:metadata: '{"cool": "beans"}'
> ARROW:extension:name: 'my:app'{code}
> Note that the first has "{{{}item: double{}}}" and the second has
> "{{{}element: double{}}}".
> (I'm also rather surprised that {{use_compliant_nested_type=False}} is an
> option. Wouldn't you want the Parquet files to always be written with
> compliant lists? I noticed this when I was having trouble getting the data
> into BigQuery.)
--
This message was sent by Atlassian Jira
(v8.20.7#820007)