[
https://issues.apache.org/jira/browse/ARROW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882313#comment-16882313
]
TP Boudreau edited comment on ARROW-5889 at 7/10/19 6:03 PM:
-------------------------------------------------------------
I can think of two possible approaches to correcting this on an interim basis
before the parquet.thrift gets changed (if it does get changed), but neither is
perfect:
1. Add a new boolean member to the parquet::TimestampLogicalType class named
fromConvertedType that is set to true if the object was constructed from a
converted type, false if the user explicitly constructed the object. While in
memory, the Arrow conversions can interrogate the property and, if "true", can
imitate the old TIMESTAMP converted type logic. On writing a schema, if the
property is "true", the writer would NOT write a TimestampLogicalType for that
field/column, but would instead write just the TIMESTAMP converted type (as it
does now already) – the original converted type semantics would be retained
both in use and on disk.
This would require changes to the recently released public API for the
TimestampLogicalType class (new creator functions, accessors, etc.). Also it
would result in a parquet file with mixed converted type and LogicalType
annotations (which seems legal, but probably wasn't intended).
2. Use file level key-value metadata to store the fact that the field was from
a converted type (as will be done for timezones). This requires changes to the
Arrow public API (converting from an Arrow schema to a Parquet would produce
both a Parquet schema and a K-V metadata object). Also, given that names are
not unique, it might be difficult to produce unique keys (knowable both on the
Arrow and Parquet sides). But both these problems will have to be addressed
eventually if timezones are to be save this way
I'd lean toward option (1.), but there might gotchas that I'm not considering
for either option. Do either of these sound like they're worth pursuing? If
so, I can work on this.
was (Author: tpboudreau):
I can think of two possible approaches to correcting this on an interim basis
before the parquet.thrift gets changed (if it does get changed), but neither is
perfect:
1. Add a new boolean member to the parquet::TimestampLogicalType class named
fromConvertedType that is set to true if the object was constructed from a
converted type, false if the user explicitly constructed the object. While in
memory, the Arrow conversions can interrogate the property and, if "true", can
imitate the old TIMESTAMP converted type logic. On writing a schema, if the
property is "true", the writer would NOT write a TimestampLogicalType for that
field/column, but would instead write just the TIMESTAMP converted type (as it
does now already) – the original converted type semantics would be retained
both in use and on disk.
This would require changes to the recently released public API for the
TimestampLogicalType class (new creator functions, accessors, etc.). Also it
would result in a parquet file with mixed converted type and LogicalType
annotations (which seems legal, but probably wasn't intended).
2. Use file level key-value metadata to store the fact that the field was from
a converted type (as will be done for timezones). This requires changes to the
Arrow public API (converting from an Arrow schema to a Parquet would produce
both a Parquet schema and a K-V metadata object). Also, given that names are
not unique, it might be difficult to produce unique keys (knowable both on the
Arrow and Parquet sides). But both these problems will have to be addressed
eventually if timezones are to be save this way
I'd lean toward option (1.), but there might gotchas that I'm not considering
for either option. Do either of these sound like they're worth pursuing? If
so, I work on this.
> [Python][C++] Parquet backwards compat for timestamps without timezone broken
> -----------------------------------------------------------------------------
>
> Key: ARROW-5889
> URL: https://issues.apache.org/jira/browse/ARROW-5889
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.14.0
> Reporter: Florian Jetter
> Priority: Minor
> Labels: parquet
> Fix For: 0.14.1
>
> Attachments: 0.12.1.parquet, 0.13.0.parquet
>
>
> When reading a parquet file which has timestamp fields they are read as a
> timestamp with timezone UTC if the parquet file was written by pyarrow 0.13.0
> and/or 0.12.1.
> Expected behavior would be that they are loaded as timestamps without any
> timezone information.
> The attached files contain one row for all basic types and a few nested
> types, the timestamp fields are called datetime64 and datetime64_tz
> see also
> [https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat]
> [https://github.com/JDASoftwareGroup/kartothek/blob/c47e52116e2dc726a74d7d6b97922a0252722ed0/tests/serialization/test_arrow_compat.py#L31]
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)