[
https://issues.apache.org/jira/browse/ARROW-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney reassigned ARROW-5878:
-----------------------------------
Assignee: Benjamin Kietzman
> [Python][C++] Parquet reader not forward compatible for timestamps without
> timezone
> -----------------------------------------------------------------------------------
>
> Key: ARROW-5878
> URL: https://issues.apache.org/jira/browse/ARROW-5878
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 0.14.0
> Reporter: Florian Jetter
> Assignee: Benjamin Kietzman
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.0.0, 0.14.1
>
> Attachments: timezones_pyarrow_14.paquet
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Timestamps without timezone which are written by pyarrow 0.14.0 cannot be
> read anymore as timestamps by earlier versions. The timestamp is read as an
> integer when reading in with pyarrow 0.13.0
> Looking at the parquet schemas, it seems that the logical type cannot be
> understood by the older versions, see below.
> h4. File generation with pyarrow 0.14.0
> {code:java}
> import datetime
> import pyarrow.parquet as pq
> import pandas as pd
> df = pd.DataFrame(
> {
> "datetime64": pd.Series(["2018-01-01"], dtype="datetime64[ns]"),
> "datetime64_ts": pd.Series(
> [pd.Timestamp(datetime.datetime(2018, 1, 1), tz="Europe/Berlin")],
> dtype="datetime64[ns]",
> ),
> }
> )
> pq.write_table(pa.Table.from_pandas(df), "timezones_pyarrow_14.paquet")
> {code}
> h4. Reading with pyarrow 0.13.0
> {code:java}
> In [1]: import pyarrow.parquet as pq
> In [2]: import pyarrow as pa
> In [3]: with open("timezones_pyarrow_14.paquet", "rb") as fd:
> ...: table = pq.read_pandas(fd)
> ...:
> In [4]: table.to_pandas()
> Out[4]:
> datetime64 datetime64_ts
> 0 1514764800000000 2018-01-01 00:00:00+01:00
> In [5]: table.to_pandas().dtypes
> Out[5]:
> datetime64 int64
> datetime64_ts datetime64[ns, Europe/Berlin]
> dtype: object
> {code}
> h3. Parquet schema as seen by pyarrow versions:
> pyarrow 0.13.0 parquet schema
> {code:java}
> datetime64: INT64
> datetime64_ts: INT64 TIMESTAMP_MICROS
> {code}
> pyarrow 0.14.0 parquet schema
> {code:java}
> datetime64: INT64 Timestamp(isAdjustedToUTC=false, timeUnit=microseconds)
> datetime64_ts: INT64 Timestamp(isAdjustedToUTC=true, timeUnit=microseconds)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)