[
https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109983#comment-17109983
]
Joris Van den Bossche commented on ARROW-8816:
----------------------------------------------
> the extra field gives "datetime" for pandas_type and "object" for numpy_type;
Ah, if there is pandas metadata present and it indicates object dtype, we could
indeed use that to avoid conversion to datetime64[ns], but keep datetime
objects. That sounds as it should be possible in principle.
> IMHO the important thing is to always be able to read back in what one wrote
> (possibly with wider types) if the write was successful, provided that one
> uses the same pyarrow version and the default options for both reading and
> writing.
Yes, but again: we need to distinguish writing parquet and pandas<->pyarrow
conversion. Just for writing parquet from a pyarrow table, fully correct
roundtrip works perfectly fine. It's only the pandas<->pyarrow conversion that
gives problems.
And note that in general, roundtrip is always tricky when a single pyarrow type
can map to multiple pandas types (like dates that can be converted to
datetime64[D] or datetime.date, or ListArray that can be converted to a column
of tuples, a column of lists or a column of numpy arrays)
But I agree a roundtrip should be possible: a {{timestamp_as_object}} keyword
should at least help (while still needing to specify a keyword). And with the
pandas metadata we could maybe try to automatically choose the good default.
> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---------------------------------------------------------------------------
>
> Key: ARROW-8816
> URL: https://issues.apache.org/jira/browse/ARROW-8816
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.16.0, 0.17.0
> Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3,
> python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3,
> python 3.8.2, ubuntu 20.04 (linux).
> Reporter: Rauli Ruohonen
> Priority: Major
>
> Using pyarrow 0.17.0, this
>
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
> print(f'Year {year:_}:')
> df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
> df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
> try:
> print(pd.read_parquet('foo.parquet', engine='pyarrow'))
> except Exception as exc:
> print(repr(exc))
> print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>
> prints
>
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out
> of bounds timestamp: 9246182400000')
> Year 2_262:
> x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>
>
> {noformat}
> Year 2_263:
> x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
> x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using
> epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still
> possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or
> fastparquet), yielding the same result as with 0.16.0 above (i.e. only
> reading has changed in 0.17.0, not writing). It would be better if an error
> was raised when attempting to write the file instead of silently producing
> erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is
> this modified example:
>
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
> print(table.to_pandas())
> except Exception as exc:
> print(repr(exc))
> {code}
> which prints
>
>
> {noformat}
> [
> [
> 2263-01-01 00:00:00.000000
> ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out
> of bounds timestamp: 9246182400000000'){noformat}
> on pyarrow 0.17.0 and
>
>
> {noformat}
> [
> [
> 2263-01-01 00:00:00.000000
> ]
> ]
> x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods,
> pyarrow prints the correct timestamp when asked to produce it as a string (so
> it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas()
> round-trip fails.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)