[jira] [Commented] (ARROW-8944) [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp

Joris Van den Bossche (Jira) Tue, 26 May 2020 11:17:26 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116951#comment-17116951
 ]


Joris Van den Bossche commented on ARROW-8944:
----------------------------------------------

The problem here is that the NaT values are converted to pyarrow as 
"0001-01-01". Using a simplified example:

{code}
In [22]: pa.array(pd.Series([pd.NaT, pd.Timestamp("2012-01-01")], 
dtype=object)) 
Out[22]: 
<pyarrow.lib.TimestampArray object at 0x7f954b176fa8>
[
 0001-01-01 00:00:00.000000,
 2012-01-01 00:00:00.000000
]
{code}

This in itself is a bug, which is covered by  ARROW-842 (and ARROW-8115). 
When converting back to pandas, it tries to convert this to nanosecond 
resolution, because this is the only resolution that pandas supports. However, 
"0001-01-01" doesn't fit into the nanosecond range, and therefore you get this 
error. There is work underway to make this conversion back to pandas more 
flexible, so you can opt for datetime objects (see eg ARROW-5359).

The above clarifies the behaviour you see. But in your case, the actual problem 
is that you are having object dtype data:

{code:python}
In [23]: dataframe.dtypes                                                       
                                                                                
                                                   
Out[23]: 
id            int64
timestamp    object
dtype: object

In [24]: dataframe['timestamp'].values                                          
                                                                                
                                                   
Out[24]: 
array([NaT, NaT, NaT, NaT,
       Timestamp('2020-03-02 03:03:17.791062+0000', tz='UTC'), NaT, NaT],
      dtype=object)
{code}

And it is therefore you run into this NaT conversion bug. 
Now the reason you have object dtype data is because of appending a dataframe 
with tz-aware data to tz-naive data:

{code:python}
# dataframe before appending dataframe2
In [27]: dataframe.dtypes                                                       
                                                                                
                                                   
Out[27]: 
id                    int64
timestamp    datetime64[ns]
dtype: object

In [28]: dataframe2.dtypes                                                      
                                                                                
                                                   
Out[28]: 
id                         int64
timestamp    datetime64[ns, UTC]
dtype: object

In [29]: dataframe.append(dataframe2).dtypes                                    
                                                                                
                                                   
Out[29]: 
id            int64
timestamp    object
dtype: object
{code}

> [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8944
>                 URL: https://issues.apache.org/jira/browse/ARROW-8944
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.17.0, 0.17.1
>         Environment: pandas==1.0.3
> pyarrow==0.17.1
> Python==3,7.6 @ Windows 10 64Bit
>            Reporter: Daniel Figus
>            Priority: Major
>
> The following pandas -> parquet -> pandas roudtrip raises an out of bounds 
> timestamp error with pyarrow 0.17.0 and 0.17.1:
> {code:python}
> import pandas
> target = 'ts_roundtrip.parquet'
> dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
> dataframe['timestamp'] = 
> pandas.to_datetime(dataframe['timestamp'],errors='raise')
> dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['', 
> '2020-03-02T03:03:17.791062Z','','']})
> dataframe2['timestamp'] = 
> pandas.to_datetime(dataframe2['timestamp'],errors='raise')
> dataframe = dataframe.append(dataframe2)
> print(dataframe.head(10))
> dataframe.to_parquet(target, coerce_timestamps=None, index=False, 
> version='2.0')
> dataframe_new = pandas.read_parquet(target)
> print(dataframe_new.head())
> {code}
> Output:
> {noformat}
>    id                         timestamp
> 0   1                               NaT
> 1   2                               NaT
> 2   3                               NaT
> 0   4                               NaT
> 1   5  2020-03-02 03:03:17.791062+00:00
> 2   6                               NaT
> 3   7                               NaT
> Traceback (most recent call last):
>   File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
>     dataframe_new = pandas.read_parquet(target)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310, 
> in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125, 
> in read
>     path, columns=columns, **kwargs
>   File "pyarrow\array.pxi", line 587, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
> 766, in table_to_blockmanager
>     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
> 1102, in _table_to_blocks
>     list(extension_columns.keys()))
>   File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
>   File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would 
> result in out of bounds timestamp: -62135596800000000
> {noformat}
> Background: 
>  We have a dataset with a timestamp column that is sparsely populated and 
> originates from many json files. So it is very likely that in some of those 
> json files there is no timestamp (as string in ISO format) and instead just 
> an empty string. Each JSON file was read into a pandas dataframe, the 
> timestamp column casted to datetime and all dataframes appended. That was 
> done with pyarrow<0.17.0 and those parquet files cannot be read any longer 
> and result in the above mentioned error message as well.
> A closer look at our old parquets show that the NaTs are converted to 
> "1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe :(. 
> You get the same result when you run the above code and pyarrow==0.16.0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8944) [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp

Reply via email to