[jira] [Commented] (ARROW-8944) [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp

Daniel Figus (Jira) Thu, 10 Sep 2020 03:05:24 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193511#comment-17193511
 ]


Daniel Figus commented on ARROW-8944:
-------------------------------------

[~jorisvandenbossche] I think this can be closed as it was resolved with 
ARROW-842. Just double checked it and my example from above works.

> [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8944
>                 URL: https://issues.apache.org/jira/browse/ARROW-8944
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.17.0, 0.17.1
>         Environment: pandas==1.0.3
> pyarrow==0.17.1
> Python==3,7.6 @ Windows 10 64Bit
>            Reporter: Daniel Figus
>            Priority: Major
>
> The following pandas -> parquet -> pandas roudtrip raises an out of bounds 
> timestamp error with pyarrow 0.17.0 and 0.17.1:
> {code:python}
> import pandas
> target = 'ts_roundtrip.parquet'
> dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
> dataframe['timestamp'] = 
> pandas.to_datetime(dataframe['timestamp'],errors='raise')
> dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['', 
> '2020-03-02T03:03:17.791062Z','','']})
> dataframe2['timestamp'] = 
> pandas.to_datetime(dataframe2['timestamp'],errors='raise')
> dataframe = dataframe.append(dataframe2)
> print(dataframe.head(10))
> dataframe.to_parquet(target, coerce_timestamps=None, index=False, 
> version='2.0')
> dataframe_new = pandas.read_parquet(target)
> print(dataframe_new.head())
> {code}
> Output:
> {noformat}
>    id                         timestamp
> 0   1                               NaT
> 1   2                               NaT
> 2   3                               NaT
> 0   4                               NaT
> 1   5  2020-03-02 03:03:17.791062+00:00
> 2   6                               NaT
> 3   7                               NaT
> Traceback (most recent call last):
>   File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
>     dataframe_new = pandas.read_parquet(target)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310, 
> in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125, 
> in read
>     path, columns=columns, **kwargs
>   File "pyarrow\array.pxi", line 587, in 
> pyarrow.lib._PandasConvertible.to_pandas
>   File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
> 766, in table_to_blockmanager
>     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
>   File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
> 1102, in _table_to_blocks
>     list(extension_columns.keys()))
>   File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
>   File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would 
> result in out of bounds timestamp: -62135596800000000
> {noformat}
> Background: 
>  We have a dataset with a timestamp column that is sparsely populated and 
> originates from many json files. So it is very likely that in some of those 
> json files there is no timestamp (as string in ISO format) and instead just 
> an empty string. Each JSON file was read into a pandas dataframe, the 
> timestamp column casted to datetime and all dataframes appended. That was 
> done with pyarrow<0.17.0 and those parquet files cannot be read any longer 
> and result in the above mentioned error message as well.
> A closer look at our old parquets show that the NaTs are converted to 
> "1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe :(. 
> You get the same result when you run the above code and pyarrow==0.16.0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8944) [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp

Reply via email to