Joris Van den Bossche created ARROW-15711:
---------------------------------------------
Summary: [C++][Parquet] Extension types with nanosecond timestamp
resolution don't roundtrip
Key: ARROW-15711
URL: https://issues.apache.org/jira/browse/ARROW-15711
Project: Apache Arrow
Issue Type: Bug
Components: C++, Parquet
Reporter: Joris Van den Bossche
Example code:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
class MyTimestampType(pa.PyExtensionType):
def __init__(self):
pa.PyExtensionType.__init__(self, pa.timestamp("ns"))
def __reduce__(self):
return MyTimestampType, ()
arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000],
pa.timestamp("ns")))
table = pa.table({"col": arr})
{code}
{code}
>>> table.schema
col: extension<arrow.py_extension_type<MyTimestampType>>
>>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet")
>>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet")
>>> result.schema
col: timestamp[us]
{code}
The reason for this is because we only restore the extension type if the
inferred storage type (inferred from parquet + after applying any updates based
on the Arrow schema) exactly equals the original storage type (as stored in the
Arrow schema):
https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977
And, with the default options, a timestamp with nanosecond resolution gets
stored as microsecond resolution in Parquet, and that is something we do not
restore when updating the read types based on the stored Arrow schema (eg we do
add a timezone, but we don't change the resolution).
An additional issue is that _if_ you loose the extension type, the field
metadata about the extension type are also lost. I think that if we cannot
restore the extension type, we should at least try to keep the ARROW:extension
field metadata as information.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)