[jira] [Comment Edited] (ARROW-5889) [Python][C++] Parquet backwards compat for timestamps without timezone broken

TP Boudreau (JIRA) Wed, 10 Jul 2019 11:05:14 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882313#comment-16882313
 ]


TP Boudreau edited comment on ARROW-5889 at 7/10/19 6:04 PM:
-------------------------------------------------------------

I can think of two possible approaches to correcting this on an interim basis 
before the parquet.thrift specification gets changed (if it does get changed), 
but neither is perfect:

1.  Add a new boolean member to the parquet::TimestampLogicalType class named 
fromConvertedType that is set to true if the object was constructed from a 
converted type, false if the user explicitly constructed the object.  While in 
memory, the Arrow conversions can interrogate the property and, if "true", can 
imitate the old TIMESTAMP converted type logic.  On writing a schema, if the 
property is "true", the writer would NOT write a TimestampLogicalType for that 
field/column, but would instead write just the TIMESTAMP converted type (as it 
does now already) – the original converted type semantics would be retained 
both in use and on disk.

This would require changes to the recently released public API for the 
TimestampLogicalType class (new creator functions, accessors, etc.).  Also it 
would result in a parquet file with mixed converted type and LogicalType 
annotations (which seems legal, but probably wasn't intended).

2. Use file level key-value metadata to store the fact that the field was from 
a converted type (as will be done for timezones).  This requires changes to the 
Arrow public API (converting from an Arrow schema to a Parquet would produce 
both a Parquet schema and a K-V metadata object).  Also, given that names are 
not unique, it might be difficult to produce unique keys (knowable both on the 
Arrow and Parquet sides).  But both these problems will have to be addressed 
eventually if timezones are to be save this way

I'd lean toward option (1.), but there might gotchas that I'm not considering 
for either option.  Do either of these sound like they're worth pursuing? If 
so, I can work on this. 


was (Author: tpboudreau):
I can think of two possible approaches to correcting this on an interim basis 
before the parquet.thrift gets changed (if it does get changed), but neither is 
perfect:

1.  Add a new boolean member to the parquet::TimestampLogicalType class named 
fromConvertedType that is set to true if the object was constructed from a 
converted type, false if the user explicitly constructed the object.  While in 
memory, the Arrow conversions can interrogate the property and, if "true", can 
imitate the old TIMESTAMP converted type logic.  On writing a schema, if the 
property is "true", the writer would NOT write a TimestampLogicalType for that 
field/column, but would instead write just the TIMESTAMP converted type (as it 
does now already) – the original converted type semantics would be retained 
both in use and on disk.

This would require changes to the recently released public API for the 
TimestampLogicalType class (new creator functions, accessors, etc.).  Also it 
would result in a parquet file with mixed converted type and LogicalType 
annotations (which seems legal, but probably wasn't intended).

2. Use file level key-value metadata to store the fact that the field was from 
a converted type (as will be done for timezones).  This requires changes to the 
Arrow public API (converting from an Arrow schema to a Parquet would produce 
both a Parquet schema and a K-V metadata object).  Also, given that names are 
not unique, it might be difficult to produce unique keys (knowable both on the 
Arrow and Parquet sides).  But both these problems will have to be addressed 
eventually if timezones are to be save this way

I'd lean toward option (1.), but there might gotchas that I'm not considering 
for either option.  Do either of these sound like they're worth pursuing? If 
so, I can work on this. 

> [Python][C++] Parquet backwards compat for timestamps without timezone broken
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-5889
>                 URL: https://issues.apache.org/jira/browse/ARROW-5889
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.14.0
>            Reporter: Florian Jetter
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.14.1
>
>         Attachments: 0.12.1.parquet, 0.13.0.parquet
>
>
> When reading a parquet file which has timestamp fields they are read as a 
> timestamp with timezone UTC if the parquet file was written by pyarrow 0.13.0 
> and/or 0.12.1.
> Expected behavior would be that they are loaded as timestamps without any 
> timezone information.
> The attached files contain one row for all basic types and a few nested 
> types, the timestamp fields are called datetime64 and datetime64_tz
> see also 
> [https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat]
> [https://github.com/JDASoftwareGroup/kartothek/blob/c47e52116e2dc726a74d7d6b97922a0252722ed0/tests/serialization/test_arrow_compat.py#L31]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-5889) [Python][C++] Parquet backwards compat for timestamps without timezone broken

Reply via email to