jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223272114
##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
_Type_HALF_FLOAT: np.float16,
_Type_FLOAT: np.float32,
_Type_DOUBLE: np.float64,
- _Type_DATE32: np.dtype('datetime64[ns]'),
- _Type_DATE64: np.dtype('datetime64[ns]'),
- _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
- _Type_DURATION: np.dtype('timedelta64[ns]'),
+ _Type_DATE32: np.dtype('datetime64[D]'),
Review Comment:
I think those test failures are related to the fact that, with our defaults,
parquet doesn't support nanoseconds, and we actually don't try to preserve the
unit when roundtripping from arrow<->parquet:
```
In [1]: table = pa.table({"col": pa.array([1, 2, 3],
pa.timestamp("s")).cast(pa.timestamp("ns"))})
In [2]: import pyarrow.parquet as pq
In [3]: pq.write_table(table, "test_nanoseconds.parquet")
In [4]: pq.read_table("test_nanoseconds.parquet")
Out[4]:
pyarrow.Table
col: timestamp[us]
----
col: [[1970-01-01 00:00:01.000000,1970-01-01 00:00:02.000000,1970-01-01
00:00:03.000000]]
```
So starting with an arrow table with nanoseconds, the result has
microseconds (even though we actually _could_ preserve the original unit,
because we store the original arrow schema in the parquet metadata. Although
that would not be a zero copy restoration, in contrast to for example restoring
the timezone, or restoring duration from int64, which is done in
`ApplyOriginalStorageMetadata`)
So this means that whenever we start with nanoseconds, we get back
microseconds after roundtrip to parquet. And then if the roundtrip actually
started from pandas using nanoseconds, we now also get microseconds in the
pandas result (while before we still got nanoseconds since we forced using that
in the arrow->pandas conversion step) ..
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]