[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

via GitHub Thu, 08 Jun 2023 09:12:17 -0700


jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223272114



##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   I think those test failures are related to the fact that, with our defaults, 
parquet doesn't support nanoseconds, and we actually don't try to preserve the 
unit when roundtripping from arrow<->parquet:
   
   ```
   In [1]: table = pa.table({"col": pa.array([1, 2, 3], 
pa.timestamp("s")).cast(pa.timestamp("ns"))})
   
   In [2]: import pyarrow.parquet as pq
   
   In [3]: pq.write_table(table, "test_nanoseconds.parquet")
   
   In [4]: pq.read_table("test_nanoseconds.parquet")
   Out[4]: 
   pyarrow.Table
   col: timestamp[us]
   ----
   col: [[1970-01-01 00:00:01.000000,1970-01-01 00:00:02.000000,1970-01-01 
00:00:03.000000]]
   ```
   
   So starting with an arrow table with nanoseconds, the result has 
microseconds (even though we actually _could_ preserve the original unit, 
because we store the original arrow schema in the parquet metadata. Although 
that would not be a zero copy restoration, in contrast to for example restoring 
the timezone, or restoring duration from int64, which is done in 
`ApplyOriginalStorageMetadata`)
   
   So this means that whenever we start with nanoseconds, we get back 
microseconds after roundtrip to parquet. And then if the roundtrip actually 
started from pandas using nanoseconds, we now also get microseconds in the 
pandas result (while before we still got nanoseconds since we forced using that 
in the arrow->pandas conversion step) ..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Reply via email to