Karik Isichei created ARROW-12096:
-------------------------------------

             Summary: [Python][C++] Pyarrow Parquet reader overflows INT96 
timestamps when converting to Arrow Array (timestamp[ns])
                 Key: ARROW-12096
                 URL: https://issues.apache.org/jira/browse/ARROW-12096
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 3.0.0, 2.0.0
         Environment: macos mojave 10.14.6
Python 3.8.3
pyarrow 3.0.0
pandas 1.2.3
            Reporter: Karik Isichei


When reading Parquet data with timestamps stored as INT96 pyarrow will assume 
that the timestamp type should be nanoseconds and when converted into an arrow 
table will cause overflow if the parquet col has stored values that are out of 
bounds for nanoseconds. 


{code:python}
# Round Trip Example
import datetime
import pandas as pd
import pyarrow as pa
from pyarrow import parquet as pq

df = pd.DataFrame({"a": [datetime.datetime(1000,1,1), 
datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
a_df = pa.Table.from_pandas(df)
a_df.schema # a: timestamp[us] 

pq.write_table(a_df, "test_round_trip.parquet", 
use_deprecated_int96_timestamps=True, version="1.0")
pfile = pq.ParquetFile("test_round_trip.parquet")
pfile.schema_arrow # a: timestamp[ns]
pq.read_table("test_round_trip.parquet").to_pandas()
# # Results in values:
# 2169-02-08 23:09:07.419103232
# 2000-01-01 00:00:00
# 1830-11-23 00:50:52.580896768
{code}


The above example is just trying to demonstrate this bug by getting pyarrow to 
write out the parquet format to a similar state of original file (where this 
bug was discovered). This bug was originally found when trying to read in 
Parquet outputs from Amazon Athena with pyarrow (where we can't control the 
output format of the parquet file format) 
[Context|https://github.com/awslabs/aws-data-wrangler/issues/592].

I found some existing issues that might also be related:

* [ARROW-10444|https://issues.apache.org/jira/browse/ARROW-10444] 
* [ARROW-6779|https://issues.apache.org/jira/browse/ARROW-6779] (This shows a 
similar response although testing this on pyarrow v3 will raise an out of 
bounds error)




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to