[ 
https://issues.apache.org/jira/browse/ARROW-15711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15711:
------------------------------------------
    Description: 
Example code:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

class MyTimestampType(pa.PyExtensionType):

    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.timestamp("ns"))

    def __reduce__(self):
        return MyTimestampType, ()


arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000], 
pa.timestamp("ns")))
table = pa.table({"col": arr})
{code}

{code}
>>> table.schema
col: extension<arrow.py_extension_type<MyTimestampType>>

>>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet")
>>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet")
>>> result.schema
col: timestamp[us]
{code}

The reason for this is because we only restore the extension type if the 
inferred storage type (inferred from parquet + after applying any updates based 
on the Arrow schema) exactly equals the original storage type (as stored in the 
Arrow schema):

https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977

And, with the default options, a timestamp with nanosecond resolution gets 
stored as microsecond resolution in Parquet, and that is something we do not 
restore when updating the read types based on the stored Arrow schema (eg we do 
add a timezone, but we don't change the resolution).

An additional issue is that _if_ you loose the extension type, the field 
metadata about the extension type are also lost. I think that if we cannot 
restore the extension type, we should at least try to keep the ARROW:extension 
field metadata as information. This is also what we do for an unrecognized 
(unregistered) extension type.

  was:
Example code:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

class MyTimestampType(pa.PyExtensionType):

    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.timestamp("ns"))

    def __reduce__(self):
        return MyTimestampType, ()


arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000], 
pa.timestamp("ns")))
table = pa.table({"col": arr})
{code}

{code}
>>> table.schema
col: extension<arrow.py_extension_type<MyTimestampType>>

>>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet")
>>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet")
>>> result.schema
col: timestamp[us]
{code}

The reason for this is because we only restore the extension type if the 
inferred storage type (inferred from parquet + after applying any updates based 
on the Arrow schema) exactly equals the original storage type (as stored in the 
Arrow schema):

https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977

And, with the default options, a timestamp with nanosecond resolution gets 
stored as microsecond resolution in Parquet, and that is something we do not 
restore when updating the read types based on the stored Arrow schema (eg we do 
add a timezone, but we don't change the resolution).

An additional issue is that _if_ you loose the extension type, the field 
metadata about the extension type are also lost. I think that if we cannot 
restore the extension type, we should at least try to keep the ARROW:extension 
field metadata as information.


> [C++][Parquet] Extension types with nanosecond timestamp resolution don't 
> roundtrip
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-15711
>                 URL: https://issues.apache.org/jira/browse/ARROW-15711
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Example code:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> class MyTimestampType(pa.PyExtensionType):
>     def __init__(self):
>         pa.PyExtensionType.__init__(self, pa.timestamp("ns"))
>     def __reduce__(self):
>         return MyTimestampType, ()
> arr = MyTimestampType().wrap_array(pa.array([1000, 2000, 3000], 
> pa.timestamp("ns")))
> table = pa.table({"col": arr})
> {code}
> {code}
> >>> table.schema
> col: extension<arrow.py_extension_type<MyTimestampType>>
> >>> pq.write_table(table, "test_parquet_extension_type_timestamp_ns.parquet")
> >>> result = pq.read_table("test_parquet_extension_type_timestamp_ns.parquet")
> >>> result.schema
> col: timestamp[us]
> {code}
> The reason for this is because we only restore the extension type if the 
> inferred storage type (inferred from parquet + after applying any updates 
> based on the Arrow schema) exactly equals the original storage type (as 
> stored in the Arrow schema):
> https://github.com/apache/arrow/blob/afaa92e7e4289d6e4f302cc91810368794e8092b/cpp/src/parquet/arrow/schema.cc#L973-L977
> And, with the default options, a timestamp with nanosecond resolution gets 
> stored as microsecond resolution in Parquet, and that is something we do not 
> restore when updating the read types based on the stored Arrow schema (eg we 
> do add a timezone, but we don't change the resolution).
> An additional issue is that _if_ you loose the extension type, the field 
> metadata about the extension type are also lost. I think that if we cannot 
> restore the extension type, we should at least try to keep the 
> ARROW:extension field metadata as information. This is also what we do for an 
> unrecognized (unregistered) extension type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to