[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

Alenka Frim (Jira) Thu, 27 Oct 2022 06:24:09 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625122#comment-17625122
 ]


Alenka Frim commented on ARROW-8816:
------------------------------------

Closing this as it is not relevant anymore (Arrow now errors with 
{{{}ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in 
out of bounds timestamp{}}}) when converting to pandas.

I did create a new issue https://issues.apache.org/jira/browse/ARROW-18175 to 
track work about using the information stored in the metadata:
{quote}Ah, if there is pandas metadata present and it indicates object dtype, 
we could indeed use that to avoid conversion to datetime64[ns], but keep 
datetime objects. That sounds as it should be possible in principle.
{quote}

> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8816
>                 URL: https://issues.apache.org/jira/browse/ARROW-8816
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.17.0
>         Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, 
> python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, 
> python 3.8.2, ubuntu 20.04 (linux).
>            Reporter: Rauli Ruohonen
>            Priority: Major
>
> Using pyarrow 0.17.0, this
>  
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
>     print(f'Year {year:_}:')
>     df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
>     df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
>     try:
>         print(pd.read_parquet('foo.parquet', engine='pyarrow'))
>     except Exception as exc:
>         print(repr(exc))
>     print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>  
> prints
>  
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out 
> of bounds timestamp: 9246182400000')
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>  
>  
> {noformat}
> Year 2_263:
>                               x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using 
> epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still 
> possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or 
> fastparquet), yielding the same result as with 0.16.0 above (i.e. only 
> reading has changed in 0.17.0, not writing). It would be better if an error 
> was raised when attempting to write the file instead of silently producing 
> erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is 
> this modified example:
>  
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
>     print(table.to_pandas())
> except Exception as exc:
>     print(repr(exc))
> {code}
> which prints
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out 
> of bounds timestamp: 9246182400000000'){noformat}
> on pyarrow 0.17.0 and
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
>                               x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, 
> pyarrow prints the correct timestamp when asked to produce it as a string (so 
> it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() 
> round-trip fails.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

Reply via email to