[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

via GitHub Thu, 15 Jun 2023 06:02:47 -0700


jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1230607268



##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -2060,9 +2097,10 @@ static Status GetPandasWriterType(const ChunkedArray& 
data, const PandasOptions&
     case Type::DATE64:
       if (options.date_as_object) {
         *output_type = PandasWriter::OBJECT;
+      } else if (options.coerce_temporal_nanoseconds) {
+        *output_type = PandasWriter::DATETIME_NANO;
       } else {
-        *output_type = options.coerce_temporal_nanoseconds ? 
PandasWriter::DATETIME_NANO
-                                                           : 
PandasWriter::DATETIME_DAY;
+        *output_type = PandasWriter::DATETIME_MILLI;

Review Comment:
   I am just realizing that this is a small regression / behaviour change for 
`to_numpy`, which also goes through this code path. Numpy does have 
datetime64[D], and so here it can make sense to actually use "D" unit when 
converting to a numpy array .. 
   I am not sure if we want some other options to indicate whether we are 
converting to numpy or pandas, though ..



##########
python/pyarrow/tests/test_array.py:
##########
@@ -212,8 +212,9 @@ def test_to_numpy_writable():
 
 
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
[email protected]('tz', [None, "UTC"])
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

Review Comment:
   Another comment related to this file (but not this specific test): we are 
testing to_numpy here, but we should also cover the fix to Array.to_pandas to 
no longer coerce to nanoseconds (since that goes through a different code path 
in the cython level compared to Table.to_pandas) 
   
   So either maybe repeat this test here but with to_pandas(), or add one below 
where there is already a `test_to_pandas_timezone` test function



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1016,13 +1020,17 @@ def test_timestamps_notimezone_nulls(self):
             expected_schema=schema,
         )
 
-    def test_timestamps_with_timezone(self):
+    @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
+    def test_timestamps_with_timezone(self, unit):
+        if Version(pd.__version__) < Version("2.0.0"):
+            # ARROW-3789: Coerce date/timestamp types to datetime64[ns]

Review Comment:
   I suppose this reference to ARROW-3789 was copied from another comment, but 
I think here it's not very useful



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1452,26 +1487,27 @@ def test_timestamp_to_pandas_out_of_bounds(self):
 
                 msg = "would result in out of bounds timestamp"
                 with pytest.raises(ValueError, match=msg):
-                    arr.to_pandas()
+                    print(arr.to_pandas(coerce_temporal_nanoseconds=True))

Review Comment:
   ```suggestion
                       arr.to_pandas(coerce_temporal_nanoseconds=True)
   ```



##########
python/pyarrow/tests/test_schema.py:
##########
@@ -45,7 +46,10 @@ def test_type_integers():
 
 
 def test_type_to_pandas_dtype():
-    M8_ns = np.dtype('datetime64[ns]')
+    M8 = np.dtype('datetime64[ms]')
+    if _pandas_api.is_v1():

Review Comment:
   In tests we generally use the actual pandas version, like 
`Version(pd.__version__) < Version("2.0")` (although this is more convenient in 
this case, it's probably fine it keep it here)



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=True, date_as_object=False)
+        expected_df = df.astype('datetime64[ns]')
+        tm.assert_frame_equal(result_df, expected_df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=False, date_as_object=False)

Review Comment:
   ```suggestion
           result_df = table.to_pandas(coerce_temporal_nanoseconds=False)
   ```



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):
+        df = pd.DataFrame({'date': [date(2000, 1, 1)]}, dtype='datetime64[ms]')
+        table = pa.Table.from_pandas(df)
+        result_df = table.to_pandas(
+            coerce_temporal_nanoseconds=True, date_as_object=False)

Review Comment:
   ```suggestion
           result_df = table.to_pandas(coerce_temporal_nanoseconds=True)
   ```
   
   We have timestamp types here, so `date_as_object` has no effect (and seeing 
it can only be confusing to the reader then)



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1199,13 +1223,25 @@ def test_table_convert_date_as_object(self):
 
         table = pa.Table.from_pandas(df, preserve_index=False)
 
-        df_datetime = table.to_pandas(date_as_object=False)
+        df_datetime = table.to_pandas(date_as_object=False,
+                                      coerce_temporal_nanoseconds=coerce_to_ns)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype(expected_type), df_datetime,
                               check_dtype=True)
         tm.assert_frame_equal(df, df_object, check_dtype=True)
 
+    def test_table_coerce_temporal_nanoseconds(self):

Review Comment:
   Can you add a similar "test_array_coerce_temporal_nanoseconds"? 



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -67,10 +67,14 @@ def _alltypes_example(size=100):
         'float32': np.arange(size, dtype=np.float32),
         'float64': np.arange(size, dtype=np.float64),
         'bool': np.random.randn(size) > 0,
-        # TODO(wesm): Pandas only support ns resolution, Arrow supports s, ms,
-        # us, ns
-        'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype("datetime64[ns]"),
+        'datetime[s]': np.arange("2016-01-01T00:00:00.001", size,
+                                 dtype='datetime64[s]'),
+        'datetime[ms]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[ms]'),
+        'datetime[us]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[us]'),
+        'datetime[ns]': np.arange("2016-01-01T00:00:00.001", size,
+                                  dtype='datetime64[ns]'),

Review Comment:
   I know this was not yet there before, but can you also add timedelta 
columns? Because also for timedelta64 (duration on the pyarrow side), pandas 
now supports the multiple resolutions and we now preserve the resolution on 
conversion to pandas (with pandas >= 2.0).



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1171,26 +1187,34 @@ def test_array_types_date_as_object(self):
                 None,
                 date(1970, 1, 1),
                 date(2040, 2, 26)]
-        expected_d = np.array(['2000-01-01', None, '1970-01-01',
-                               '2040-02-26'], dtype='datetime64[D]')
+        expected_days = np.array(['2000-01-01', None, '1970-01-01',
+                                  '2040-02-26'], dtype='datetime64[D]')
 
-        expected_ns = np.array(['2000-01-01', None, '1970-01-01',
-                                '2040-02-26'], dtype='datetime64[ns]')
+        expected_dtype = 'datetime64[ms]'
+        if Version(pd.__version__) < Version("2.0.0"):
+            # ARROW-3789: Coerce date/timestamp types to datetime64[ns]
+            expected_dtype = 'datetime64[ns]'
+
+        expected = np.array(['2000-01-01', None, '1970-01-01',
+                             '2040-02-26'], dtype=expected_dtype)
 
         objects = [pa.array(data),
                    pa.chunked_array([data])]
 
         for obj in objects:
             result = obj.to_pandas()
-            expected_obj = expected_d.astype(object)
+            expected_obj = expected_days.astype(object)
             assert result.dtype == expected_obj.dtype
             npt.assert_array_equal(result, expected_obj)
 
             result = obj.to_pandas(date_as_object=False)
-            assert result.dtype == expected_ns.dtype
-            npt.assert_array_equal(result, expected_ns)
+            assert result.dtype == expected.dtype
+            npt.assert_array_equal(result, expected)
 
-    def test_table_convert_date_as_object(self):
+    @pytest.mark.parametrize("coerce_to_ns,expected_type",
+                             [(False, 'datetime64[ms]'),
+                              (True, 'datetime64[ns]')])

Review Comment:
   Can you add this parametrization also to the test above? (for 
array.to_pandas)



##########
python/pyarrow/types.pxi:
##########
@@ -23,6 +23,8 @@ import re
 import sys
 import warnings
 
+from pyarrow.lib import _pandas_api

Review Comment:
   I think this import shouldn't be needed, as we essentially are "in" 
pyarrow.lib? (types.pxi gets included in lib.pyx)



##########
python/pyarrow/tests/test_array.py:
##########
@@ -212,8 +212,9 @@ def test_to_numpy_writable():
 
 
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
[email protected]('tz', [None, "UTC"])
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

Review Comment:
   Although, you maybe already covered this in test_pandas.py? While many of 
the tests in `TestConvertDateTimeLikeTypes` use a table for the roundtrip 
testing, I see that there are also a bunch that use arrays more towards the end 
of the test class.



##########
python/pyarrow/types.pxi:
##########
@@ -115,6 +128,21 @@ def _is_primitive(Type type):
     return is_primitive(type)
 
 
+def _get_pandas_type(type, unit=None):
+    if type not in _pandas_type_map:
+        return None
+    if type in [_Type_DATE32, _Type_DATE64, _Type_TIMESTAMP, _Type_DURATION]:
+        if _pandas_api.is_v1():

Review Comment:
   Nitpick: maybe switch the two checks (assuming the is_v1 is cheaper), or on 
a single line as `if _pandas_api.is_v1() and type in ..` (although it probably 
doesn't fit on a single line anyway)



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1437,13 +1478,7 @@ def test_timestamp_to_pandas_ns(self):
 
     def test_timestamp_to_pandas_out_of_bounds(self):
         # ARROW-7758 check for out of bounds timestamps for non-ns timestamps
-
-        if Version(pd.__version__) >= Version("2.1.0.dev"):
-            # GH-35235: test fail due to __from_pyarrow__ being added to pandas
-            # https://github.com/pandas-dev/pandas/pull/52201
-            # Needs: https://github.com/apache/arrow/issues/33321
-            pytest.skip(
-                "Need support converting to non-nano datetime64 for pandas >= 
2.0")

Review Comment:
   So this still seems to fail on pandas nightly 
(https://github.com/apache/arrow/pull/35656#issuecomment-1592881853)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Reply via email to