[jira] [Commented] (ARROW-12096) [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns])

Karik Isichei (Jira) Mon, 07 Jun 2021 00:59:16 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358424#comment-17358424
 ]


Karik Isichei commented on ARROW-12096:
---------------------------------------

Have created a PR for the fix (only on the C++ functionality).

[https://github.com/apache/arrow/pull/10461]

 

Let me know if there are any problems or improvements. I was thinking of doing 
C++ and Python but thought better to solve the C++ functionality first and then 
do a secondary PR for exposing the functionality in Python. 

> [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when 
> converting to Arrow Array (timestamp[ns])
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12096
>                 URL: https://issues.apache.org/jira/browse/ARROW-12096
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 2.0.0, 3.0.0
>         Environment: macos mojave 10.14.6
> Python 3.8.3
> pyarrow 3.0.0
> pandas 1.2.3
>            Reporter: Karik Isichei
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When reading Parquet data with timestamps stored as INT96 pyarrow will assume 
> that the timestamp type should be nanoseconds and when converted into an 
> arrow table will cause overflow if the parquet col has stored values that are 
> out of bounds for nanoseconds. 
> {code:python}
> # Round Trip Example
> import datetime
> import pandas as pd
> import pyarrow as pa
> from pyarrow import parquet as pq
> df = pd.DataFrame({"a": [datetime.datetime(1000,1,1), 
> datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
> a_df = pa.Table.from_pandas(df)
> a_df.schema # a: timestamp[us] 
> pq.write_table(a_df, "test_round_trip.parquet", 
> use_deprecated_int96_timestamps=True, version="1.0")
> pfile = pq.ParquetFile("test_round_trip.parquet")
> pfile.schema_arrow # a: timestamp[ns]
> pq.read_table("test_round_trip.parquet").to_pandas()
> # # Results in values:
> # 2169-02-08 23:09:07.419103232
> # 2000-01-01 00:00:00
> # 1830-11-23 00:50:52.580896768
> {code}
> The above example is just trying to demonstrate this bug by getting pyarrow 
> to write out the parquet format to a similar state of original file (where 
> this bug was discovered). This bug was originally found when trying to read 
> in Parquet outputs from Amazon Athena with pyarrow (where we can't control 
> the output format of the parquet file format) 
> [Context|https://github.com/awslabs/aws-data-wrangler/issues/592].
> I found some existing issues that might also be related:
> * [ARROW-10444|https://issues.apache.org/jira/browse/ARROW-10444] 
> * [ARROW-6779|https://issues.apache.org/jira/browse/ARROW-6779] (This shows a 
> similar response although testing this on pyarrow v3 will raise an out of 
> bounds error)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12096) [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns])

Reply via email to