[
https://issues.apache.org/jira/browse/ARROW-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339339#comment-17339339
]
Karik Isichei commented on ARROW-12096:
---------------------------------------
Hi all,
Have had some time to look at this / get my head around the CPP codebase. What
makes sense to me following [~apitrou]'s advice (exposing the
{{ArrowReaderProperties}} options and taking them into account in
{{GetArrowType}}, {{TransferInt96}}) would be to do the following:
1. Add the option like {{int96_timestamp_type_as}} = <any arrow timestamp
defaults to NS for backwards compatibility>
2. Change {{Int96GetNanoSeconds}}(in {{cpp/src/parquet/types.h) }}to something
like {{Int96GetSeconds}} where it has an aditional parameter which is the
interval size (again defaults to NS). Then if the interval was defined as
anything other than NS the users may get truncated TS / data loss when
converting the nanosecond component of the INT96 timestamp to the specified
interval.
*Some Questions on the above:*
* Does this sound like an O.K. approach to take? Not sure if renaming
functions in your codebase is against any contribution styles (etc) as far as I
can tell that function isn't really used too much so a rename wouldn't be too
much effort. Otherwise would a preferred approach be to create a function for
each INT96 -> Timestamp Interval?
* Do I need to raise any warning stuff if timestamps are being truncated in
the C++ code or is it something that would just be expressed in the docs? I.e.
if you specify your own timestamp for INT96 that is at your own risk.
> [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when
> converting to Arrow Array (timestamp[ns])
> --------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-12096
> URL: https://issues.apache.org/jira/browse/ARROW-12096
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 2.0.0, 3.0.0
> Environment: macos mojave 10.14.6
> Python 3.8.3
> pyarrow 3.0.0
> pandas 1.2.3
> Reporter: Karik Isichei
> Priority: Major
>
> When reading Parquet data with timestamps stored as INT96 pyarrow will assume
> that the timestamp type should be nanoseconds and when converted into an
> arrow table will cause overflow if the parquet col has stored values that are
> out of bounds for nanoseconds.
> {code:python}
> # Round Trip Example
> import datetime
> import pandas as pd
> import pyarrow as pa
> from pyarrow import parquet as pq
> df = pd.DataFrame({"a": [datetime.datetime(1000,1,1),
> datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
> a_df = pa.Table.from_pandas(df)
> a_df.schema # a: timestamp[us]
> pq.write_table(a_df, "test_round_trip.parquet",
> use_deprecated_int96_timestamps=True, version="1.0")
> pfile = pq.ParquetFile("test_round_trip.parquet")
> pfile.schema_arrow # a: timestamp[ns]
> pq.read_table("test_round_trip.parquet").to_pandas()
> # # Results in values:
> # 2169-02-08 23:09:07.419103232
> # 2000-01-01 00:00:00
> # 1830-11-23 00:50:52.580896768
> {code}
> The above example is just trying to demonstrate this bug by getting pyarrow
> to write out the parquet format to a similar state of original file (where
> this bug was discovered). This bug was originally found when trying to read
> in Parquet outputs from Amazon Athena with pyarrow (where we can't control
> the output format of the parquet file format)
> [Context|https://github.com/awslabs/aws-data-wrangler/issues/592].
> I found some existing issues that might also be related:
> * [ARROW-10444|https://issues.apache.org/jira/browse/ARROW-10444]
> * [ARROW-6779|https://issues.apache.org/jira/browse/ARROW-6779] (This shows a
> similar response although testing this on pyarrow v3 will raise an out of
> bounds error)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)