[
https://issues.apache.org/jira/browse/ARROW-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358424#comment-17358424
]
Karik Isichei commented on ARROW-12096:
---------------------------------------
Have created a PR for the fix (only on the C++ functionality).
[https://github.com/apache/arrow/pull/10461]
Let me know if there are any problems or improvements. I was thinking of doing
C++ and Python but thought better to solve the C++ functionality first and then
do a secondary PR for exposing the functionality in Python.
> [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when
> converting to Arrow Array (timestamp[ns])
> --------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-12096
> URL: https://issues.apache.org/jira/browse/ARROW-12096
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 2.0.0, 3.0.0
> Environment: macos mojave 10.14.6
> Python 3.8.3
> pyarrow 3.0.0
> pandas 1.2.3
> Reporter: Karik Isichei
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> When reading Parquet data with timestamps stored as INT96 pyarrow will assume
> that the timestamp type should be nanoseconds and when converted into an
> arrow table will cause overflow if the parquet col has stored values that are
> out of bounds for nanoseconds.
> {code:python}
> # Round Trip Example
> import datetime
> import pandas as pd
> import pyarrow as pa
> from pyarrow import parquet as pq
> df = pd.DataFrame({"a": [datetime.datetime(1000,1,1),
> datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
> a_df = pa.Table.from_pandas(df)
> a_df.schema # a: timestamp[us]
> pq.write_table(a_df, "test_round_trip.parquet",
> use_deprecated_int96_timestamps=True, version="1.0")
> pfile = pq.ParquetFile("test_round_trip.parquet")
> pfile.schema_arrow # a: timestamp[ns]
> pq.read_table("test_round_trip.parquet").to_pandas()
> # # Results in values:
> # 2169-02-08 23:09:07.419103232
> # 2000-01-01 00:00:00
> # 1830-11-23 00:50:52.580896768
> {code}
> The above example is just trying to demonstrate this bug by getting pyarrow
> to write out the parquet format to a similar state of original file (where
> this bug was discovered). This bug was originally found when trying to read
> in Parquet outputs from Amazon Athena with pyarrow (where we can't control
> the output format of the parquet file format)
> [Context|https://github.com/awslabs/aws-data-wrangler/issues/592].
> I found some existing issues that might also be related:
> * [ARROW-10444|https://issues.apache.org/jira/browse/ARROW-10444]
> * [ARROW-6779|https://issues.apache.org/jira/browse/ARROW-6779] (This shows a
> similar response although testing this on pyarrow v3 will raise an out of
> bounds error)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)