Daniel Figus created ARROW-8944:
-----------------------------------

             Summary: [Python] Pandas - Parquet - Pandas roundtrip causes out 
of bounds timestamp
                 Key: ARROW-8944
                 URL: https://issues.apache.org/jira/browse/ARROW-8944
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.17.1, 0.17.0
         Environment: pandas==1.0.3
pyarrow==0.17.1
Python==3,7.6 @ Windows 10 64Bit
            Reporter: Daniel Figus


The following pandas -> parquet -> pandas roudtrip raises an out of bounds 
timestamp error with pyarrow 0.17.0 and 0.17.1:
{code:python}
import pandas

target = 'ts_roundtrip.parquet'

dataframe = pandas.DataFrame({'id':[1,2,3],'timestamp':['', '', '']})
dataframe['timestamp'] = 
pandas.to_datetime(dataframe['timestamp'],errors='raise')

dataframe2 = pandas.DataFrame({'id':[4,5,6,7],'timestamp':['', 
'2020-03-02T03:03:17.791062Z','','']})
dataframe2['timestamp'] = 
pandas.to_datetime(dataframe2['timestamp'],errors='raise')
dataframe = dataframe.append(dataframe2)

print(dataframe.head(10))

dataframe.to_parquet(target, coerce_timestamps=None, index=False, version='2.0')

dataframe_new = pandas.read_parquet(target)
print(dataframe_new.head())
{code}
Output:
{noformat}
   id                         timestamp
0   1                               NaT
1   2                               NaT
2   3                               NaT
0   4                               NaT
1   5  2020-03-02 03:03:17.791062+00:00
2   6                               NaT
3   7                               NaT
Traceback (most recent call last):
  File "c:\some\path\pyarrow_ts_test.py", line 16, in <module>
    dataframe_new = pandas.read_parquet(target)
  File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 310, in 
read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "c:\some\path\venv\lib\site-packages\pandas\io\parquet.py", line 125, in 
read
    path, columns=columns, **kwargs
  File "pyarrow\array.pxi", line 587, in 
pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow\table.pxi", line 1640, in pyarrow.lib.Table._to_pandas
  File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
766, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "c:\some\path\venv\lib\site-packages\pyarrow\pandas_compat.py", line 
1102, in _table_to_blocks
    list(extension_columns.keys()))
  File "pyarrow\table.pxi", line 1107, in pyarrow.lib.table_to_blocks
  File "pyarrow\error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would 
result in out of bounds timestamp: -62135596800000000
{noformat}
Background: 
 We have a dataset with a timestamp column that is sparsely populated and 
originates from many json files. So it is very likely that in some of those 
json files there is no timestamp (as string in ISO format) and instead just an 
empty string. Each JSON file was read into a pandas dataframe, the timestamp 
column casted to datetime and all dataframes appended. That was done with 
pyarrow<0.17.0 and those parquet files cannot be read any longer and result in 
the above mentioned error message as well.

A closer look at our old parquets show that the NaTs are converted to 
"1754-08-30 22:43:41.128654848" when reading back to a pandas dataframe :(. You 
get the same result when you run the above code and pyarrow==0.16.0. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to