Joris Van den Bossche created ARROW-18124:
---------------------------------------------
Summary: [Python] Support converting to non-nano datetime64 for
pandas >= 2.0
Key: ARROW-18124
URL: https://issues.apache.org/jira/browse/ARROW-18124
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Reporter: Joris Van den Bossche
Fix For: 11.0.0
Pandas is adding capabilities to store non-nanosecond datetime64 data. At the
moment, we however always do convert to nanosecond, regardless of the timestamp
resolution of the arrow table (and regardless of the pandas metadata).
Using the development version of pandas:
{code}
In [1]: df = pd.DataFrame({"col": np.arange("2012-01-01", 10,
dtype="datetime64[s]")})
In [2]: df.dtypes
Out[2]:
col datetime64[s]
dtype: object
In [3]: table = pa.table(df)
In [4]: table.schema
Out[4]:
col: timestamp[s]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 423
In [6]: table.to_pandas().dtypes
Out[6]:
col datetime64[ns]
dtype: object
{code}
This is because we have a {{coerce_temporal_nanoseconds}} conversion option
which we hardcode to True (for top-level columns, we hardcode it to False for
nested data).
When users have pandas >= 2, we should support converting with preserving the
resolution. We should certainly do so if the pandas metadata indicates which
resolution was originally used (to ensure correct roundtrip).
We _could_ (and at some point also _should_) also do that by default if there
is no pandas metadata (but maybe only later depending on how stable this new
feature is in pandas, as it is potentially a breaking change for our users if
you use eg pyarrow to read a parquet file).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)