[
https://issues.apache.org/jira/browse/ARROW-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104355#comment-17104355
]
Tanguy Fautre commented on ARROW-5359:
--------------------------------------
I suspect this feature is needed to support Parquet files containing timestamps
in ms or us, where entries such as 0001-01-01 00:00 or 9999-12-31 23:59 need to
be supported (in our use case, these are MinValue and MaxValue of DateTime in
C#).
The following code works until {{to_pandas()}} is called. This latter part
tries to convert timestamp[ms] to timestamp[ns] (hence the {{safe=False}}) and
converts 0001-01-01 to 1754-08-30.
- Python 3.8.2 x64
- Pandas 1.0.3
- PyArrow 0.17.0
{code:python}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({
'id': [1, 2, 3],
'dateTime': [np.datetime64('0001-01-01 00:00', 'ms'),
np.datetime64('2012-05-02 12:35', 'ms'), np.datetime64('2012-05-03 15:42',
'ms')],
'value': [1.1, 2.2, 3.3]})
table = pa.Table.from_pandas(df)
pq.write_table(table, 'timeseries.parquet')
result = pq.read_table('timeseries.parquet')
df2 = result.to_pandas(date_as_object = True, safe = False)
{code}
df
{code}
id dateTime value
0 1 0001-01-01T00:00:00.000 1.1
1 2 2012-05-02T12:35:00.000 2.2
2 3 2012-05-03T15:42:00.000 3.3
{code}
df['dateTime']
{code}
0 0001-01-01T00:00:00.000
1 2012-05-02T12:35:00.000
2 2012-05-03T15:42:00.000
Name: dateTime, dtype: object
{code}
table
{code}
pyarrow.Table
id: int64
dateTime: timestamp[ms]
value: double
{code}
result
{code}
pyarrow.Table
id: int64
dateTime: timestamp[ms]
value: double
{code}
df2
{code}
id dateTime value
0 1 1754-08-30 22:43:41.128654848 1.1
1 2 2012-05-02 12:35:00.000000000 2.2
2 3 2012-05-03 15:42:00.000000000 3.3
{code}
df2['dateTime']
{code}
0 1754-08-30 22:43:41.128654848
1 2012-05-02 12:35:00.000000000
2 2012-05-03 15:42:00.000000000
Name: dateTime, dtype: datetime64[ns]
{code}
> [Python] timestamp_as_object support for pa.Table.to_pandas in pyarrow
> ----------------------------------------------------------------------
>
> Key: ARROW-5359
> URL: https://issues.apache.org/jira/browse/ARROW-5359
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.13.0
> Environment: Ubuntu
> Reporter: Joe Muruganandam
> Priority: Major
>
> Creating ticket for issue reported in
> github([https://github.com/apache/arrow/issues/4284])
> h2. pyarrow (Issue with timestamp conversion from arrow to pandas)
> pyarrow Table.to_pandas has option date_as_object but does not have similar
> option for timestamp. When a timestamp column in arrow table is converted to
> pandas the target datetype is pd.Timestamp and pd.Timestamp does not handle
> time > 2262-04-11 23:47:16.854775807 and hence in the below scenario the date
> is transformed to incorrect value. Adding timestamp_as_object option in
> pa.Table.to_pandas will help in this scenario.
> #Python(3.6.8)
> import pandas as pd
> import pyarrow as pa
> pd.*version*
> '0.24.1'
> pa.*version*
> '0.13.0'
> import datetime
> df = pd.DataFrame(\{"test_date":
> [datetime.datetime(3000,12,31,12,0),datetime.datetime(3100,12,31,12,0)]})
> df
> test_date
> 0 3000-12-31 12:00:00
> 1 3100-12-31 12:00:00
> pa_table = pa.Table.from_pandas(df)
> pa_table[0]
> Column name='test_date' type=TimestampType(timestamp[us])
> [
> [
> 32535172800000000,
> 35690846400000000
> ]
> ]
> pa_table.to_pandas()
> test_date
> 0 1831-11-22 12:50:52.580896768
> 1 1931-11-22 12:50:52.580896768
--
This message was sent by Atlassian Jira
(v8.3.4#803005)