[jira] [Commented] (ARROW-12096) [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns])

Karik Isichei (Jira) Fri, 30 Apr 2021 12:55:15 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337599#comment-17337599
 ]


Karik Isichei commented on ARROW-12096:
---------------------------------------

Thanks [~apitrou] for pointing me in the right direction, appreciate it!

> [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when 
> converting to Arrow Array (timestamp[ns])
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12096
>                 URL: https://issues.apache.org/jira/browse/ARROW-12096
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 2.0.0, 3.0.0
>         Environment: macos mojave 10.14.6
> Python 3.8.3
> pyarrow 3.0.0
> pandas 1.2.3
>            Reporter: Karik Isichei
>            Priority: Major
>
> When reading Parquet data with timestamps stored as INT96 pyarrow will assume 
> that the timestamp type should be nanoseconds and when converted into an 
> arrow table will cause overflow if the parquet col has stored values that are 
> out of bounds for nanoseconds. 
> {code:python}
> # Round Trip Example
> import datetime
> import pandas as pd
> import pyarrow as pa
> from pyarrow import parquet as pq
> df = pd.DataFrame({"a": [datetime.datetime(1000,1,1), 
> datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
> a_df = pa.Table.from_pandas(df)
> a_df.schema # a: timestamp[us] 
> pq.write_table(a_df, "test_round_trip.parquet", 
> use_deprecated_int96_timestamps=True, version="1.0")
> pfile = pq.ParquetFile("test_round_trip.parquet")
> pfile.schema_arrow # a: timestamp[ns]
> pq.read_table("test_round_trip.parquet").to_pandas()
> # # Results in values:
> # 2169-02-08 23:09:07.419103232
> # 2000-01-01 00:00:00
> # 1830-11-23 00:50:52.580896768
> {code}
> The above example is just trying to demonstrate this bug by getting pyarrow 
> to write out the parquet format to a similar state of original file (where 
> this bug was discovered). This bug was originally found when trying to read 
> in Parquet outputs from Amazon Athena with pyarrow (where we can't control 
> the output format of the parquet file format) 
> [Context|https://github.com/awslabs/aws-data-wrangler/issues/592].
> I found some existing issues that might also be related:
> * [ARROW-10444|https://issues.apache.org/jira/browse/ARROW-10444] 
> * [ARROW-6779|https://issues.apache.org/jira/browse/ARROW-6779] (This shows a 
> similar response although testing this on pyarrow v3 will raise an out of 
> bounds error)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12096) [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns])

Reply via email to