[
https://issues.apache.org/jira/browse/ARROW-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116951#comment-17116951
]
Joris Van den Bossche commented on ARROW-8944:
----------------------------------------------
The problem here is that the NaT values are converted to pyarrow as
"0001-01-01". Using a simplified example:
{code}
In [22]: pa.array(pd.Series([pd.NaT, pd.Timestamp("2012-01-01")],
dtype=object))
Out[22]:
<pyarrow.lib.TimestampArray object at 0x7f954b176fa8>
[
0001-01-01 00:00:00.000000,
2012-01-01 00:00:00.000000
]
{code}
This in itself is a bug, which is covered by ARROW-842 (and ARROW-8115).
When converting back to pandas, it tries to convert this to nanosecond
resolution, because this is the only resolution that pandas supports. However,
"0001-01-01" doesn't fit into the nanosecond range, and therefore you get this
error. There is work underway to make this conversion back to pandas more
flexible, so you can opt for datetime objects (see eg ARROW-5359).
The above clarifies the behaviour you see. But in your case, the actual problem
is that you are having object dtype data:
{code:python}
In [23]: dataframe.dtypes
Out[23]:
id int64
timestamp object
dtype: object
In [24]: dataframe['timestamp'].values
Out[24]:
array([NaT, NaT, NaT, NaT,
Timestamp('2020-03-02 03:03:17.791062+0000', tz='UTC'), NaT, NaT],
dtype=object)
{code}
And it is therefore you run into this NaT conversion bug.
Now the reason you have object dtype data is because of appending a dataframe
with tz-aware data to tz-naive data:
{code:python}
# dataframe before appending dataframe2
In [27]: dataframe.dtypes
Out[27]:
id int64
timestamp datetime64[ns]
dtype: object
In [28]: dataframe2.dtypes
Out[28]:
id int64
timestamp datetime64[ns, UTC]
dtype: object
In [29]: dataframe.append(dataframe2).dtypes
Out[29]:
id int64
timestamp object
dtype: object
{code}
> [Python] Pandas - Parquet - Pandas roundtrip causes out of bounds timestamp
> ---------------------------------------------------------------------------
>
> Key: ARROW-8944
> URL: https://issues.apache.org/jira/browse/ARROW-8944
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.17.0, 0.17.1
> Environment: pandas==1.0.3
> pyarrow==0.17.1
> Python==3,7.6 @ Windows 10 64Bit
> Reporter: Daniel Figus
> Priority: Major
>
> The following pandas -> parquet -> pandas roudtrip raises an out of bounds
> timestamp error with pyarrow 0.17.0 and 0.17.1:
> {code:python}
> import pandas
> target = 'ts_roundtrip.parquet'
> dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
> dataframe['timestamp'] =
> pandas.to_datetime(dataframe['timestamp'],errors='raise')
> dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['',
> '2020-03-02T03:03:17.791062Z','','']})
> dataframe2['timestamp'] =
> pandas.to_datetime(dataframe2['timestamp'],errors='raise')
> dataframe = dataframe.append(dataframe2)
> print(dataframe.head(10))
> dataframe.to_parquet(target, coerce_timestamps=None, index=False,
> version='2.0')
> dataframe_new = pandas.read_parquet(target)
> print(dataframe_new.head())
> {code}
> Output:
> {noformat}
> id timestamp
> 0 1 NaT
> 1 2 NaT
> 2 3 NaT
> 0 4 NaT
> 1 5 2020-03-02 03:03:17.791062+00:00
> 2 6 NaT
> 3 7 NaT
> Traceback (most recent call last):
> File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
> dataframe_new = pandas.read_parquet(target)
> File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310,
> in read_parquet
> return impl.read(path, columns=columns, **kwargs)
> File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125,
> in read
> path, columns=columns, **kwargs
> File "pyarrow\array.pxi", line 587, in
> pyarrow.lib._PandasConvertible.to_pandas
> File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
> File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line
> 766, in table_to_blockmanager
> blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
> File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line
> 1102, in _table_to_blocks
> list(extension_columns.keys()))
> File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
> File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would
> result in out of bounds timestamp: -62135596800000000
> {noformat}
> Background:
> We have a dataset with a timestamp column that is sparsely populated and
> originates from many json files. So it is very likely that in some of those
> json files there is no timestamp (as string in ISO format) and instead just
> an empty string. Each JSON file was read into a pandas dataframe, the
> timestamp column casted to datetime and all dataframes appended. That was
> done with pyarrow<0.17.0 and those parquet files cannot be read any longer
> and result in the above mentioned error message as well.
> A closer look at our old parquets show that the NaTs are converted to
> "1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe :(.
> You get the same result when you run the above code and pyarrow==0.16.0.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)