[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

Joris Van den Bossche (Jira) Mon, 18 May 2020 00:46:34 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109983#comment-17109983
 ]


Joris Van den Bossche commented on ARROW-8816:
----------------------------------------------

>  the extra field gives "datetime" for pandas_type and "object" for numpy_type;

Ah, if there is pandas metadata present and it indicates object dtype, we could 
indeed use that to avoid conversion to datetime64[ns], but keep datetime 
objects. That sounds as it should be possible in principle.

> IMHO the important thing is to always be able to read back in what one wrote 
> (possibly with wider types) if the write was successful, provided that one 
> uses the same pyarrow version and the default options for both reading and 
> writing.

Yes, but again: we need to distinguish writing parquet and pandas<->pyarrow 
conversion. Just for writing parquet from a pyarrow table, fully correct 
roundtrip works perfectly fine. It's only the pandas<->pyarrow conversion that 
gives problems. 
And note that in general, roundtrip is always tricky when a single pyarrow type 
can map to multiple pandas types (like dates that can be converted to 
datetime64[D] or datetime.date, or ListArray that can be converted to a column 
of tuples, a column of lists or a column of numpy arrays)

But I agree a roundtrip should be possible: a {{timestamp_as_object}} keyword 
should at least help (while still needing to specify a keyword). And with the 
pandas metadata we could maybe try to automatically choose the good default. 

> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8816
>                 URL: https://issues.apache.org/jira/browse/ARROW-8816
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.17.0
>         Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, 
> python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, 
> python 3.8.2, ubuntu 20.04 (linux).
>            Reporter: Rauli Ruohonen
>            Priority: Major
>
> Using pyarrow 0.17.0, this
>  
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
>     print(f'Year {year:_}:')
>     df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
>     df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
>     try:
>         print(pd.read_parquet('foo.parquet', engine='pyarrow'))
>     except Exception as exc:
>         print(repr(exc))
>     print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>  
> prints
>  
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out 
> of bounds timestamp: 9246182400000')
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>  
>  
> {noformat}
> Year 2_263:
>                               x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using 
> epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still 
> possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or 
> fastparquet), yielding the same result as with 0.16.0 above (i.e. only 
> reading has changed in 0.17.0, not writing). It would be better if an error 
> was raised when attempting to write the file instead of silently producing 
> erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is 
> this modified example:
>  
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
>     print(table.to_pandas())
> except Exception as exc:
>     print(repr(exc))
> {code}
> which prints
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out 
> of bounds timestamp: 9246182400000000'){noformat}
> on pyarrow 0.17.0 and
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
>                               x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, 
> pyarrow prints the correct timestamp when asked to produce it as a string (so 
> it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() 
> round-trip fails.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

Reply via email to