[
https://issues.apache.org/jira/browse/ARROW-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193511#comment-17193511
]
Daniel Figus commented on ARROW-8944:
-------------------------------------
[~jorisvandenbossche] I think this can be closed as it was resolved with
ARROW-842. Just double checked it and my example from above works.
> [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp
> ---------------------------------------------------------------------------
>
> Key: ARROW-8944
> URL: https://issues.apache.org/jira/browse/ARROW-8944
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.17.0, 0.17.1
> Environment: pandas==1.0.3
> pyarrow==0.17.1
> Python==3,7.6 @ Windows 10 64Bit
> Reporter: Daniel Figus
> Priority: Major
>
> The following pandas -> parquet -> pandas roudtrip raises an out of bounds
> timestamp error with pyarrow 0.17.0 and 0.17.1:
> {code:python}
> import pandas
> target = 'ts_roundtrip.parquet'
> dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
> dataframe['timestamp'] =
> pandas.to_datetime(dataframe['timestamp'],errors='raise')
> dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['',
> '2020-03-02T03:03:17.791062Z','','']})
> dataframe2['timestamp'] =
> pandas.to_datetime(dataframe2['timestamp'],errors='raise')
> dataframe = dataframe.append(dataframe2)
> print(dataframe.head(10))
> dataframe.to_parquet(target, coerce_timestamps=None, index=False,
> version='2.0')
> dataframe_new = pandas.read_parquet(target)
> print(dataframe_new.head())
> {code}
> Output:
> {noformat}
> id timestamp
> 0 1 NaT
> 1 2 NaT
> 2 3 NaT
> 0 4 NaT
> 1 5 2020-03-02 03:03:17.791062+00:00
> 2 6 NaT
> 3 7 NaT
> Traceback (most recent call last):
> File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
> dataframe_new = pandas.read_parquet(target)
> File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310,
> in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125,
> in read
> path, columns=columns, **kwargs
> File "pyarrow\array.pxi", line 587, in
> pyarrow.lib._PandasConvertible.to_pandas
> File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
> File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line
> 766, in table_to_blockmanager
> blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
> File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line
> 1102, in _table_to_blocks
> list(extension_columns.keys()))
> File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
> File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would
> result in out of bounds timestamp: -62135596800000000
> {noformat}
> Background:
> We have a dataset with a timestamp column that is sparsely populated and
> originates from many json files. So it is very likely that in some of those
> json files there is no timestamp (as string in ISO format) and instead just
> an empty string. Each JSON file was read into a pandas dataframe, the
> timestamp column casted to datetime and all dataframes appended. That was
> done with pyarrow<0.17.0 and those parquet files cannot be read any longer
> and result in the above mentioned error message as well.
> A closer look at our old parquets show that the NaTs are converted to
> "1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe :(.
> You get the same result when you run the above code and pyarrow==0.16.0.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)