[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

via GitHub Thu, 08 Jun 2023 12:21:11 -0700


danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223418022



##########
python/pyarrow/src/arrow/python/arrow_to_pandas.cc:
##########
@@ -1618,31 +1645,37 @@ class DatetimeNanoWriter : public 
DatetimeWriter<TimeUnit::NANO> {
   }
 };
 
-class DatetimeTZWriter : public DatetimeNanoWriter {
- public:
-  DatetimeTZWriter(const PandasOptions& options, const std::string& timezone,
-                   int64_t num_rows)
-      : DatetimeNanoWriter(options, num_rows, 1), timezone_(timezone) {}
-
- protected:
-  Status GetResultBlock(PyObject** out) override {
-    RETURN_NOT_OK(MakeBlock1D());
-    *out = block_arr_.obj();
-    return Status::OK();
-  }
-
-  Status AddResultMetadata(PyObject* result) override {
-    PyObject* py_tz = PyUnicode_FromStringAndSize(
-        timezone_.c_str(), static_cast<Py_ssize_t>(timezone_.size()));
-    RETURN_IF_PYERROR();
-    PyDict_SetItemString(result, "timezone", py_tz);
-    Py_DECREF(py_tz);
-    return Status::OK();
-  }
+// TODO (do not merge) how to templatize this..

Review Comment:
   I have a TODO to try something better than a #define before merging



##########
python/pyarrow/tests/parquet/common.py:
##########
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
         # TODO(wesm): Test other timestamp resolutions now that arrow supports
         # them
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype('datetime64[ns]'),
-        'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
+                              dtype='datetime64[ms]'),
+        'timedelta': np.arange(0, size, dtype="timedelta64[s]"),

Review Comment:
   Reverting https://github.com/apache/arrow/pull/14460



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1329,7 +1330,7 @@ def _test_write_to_dataset_no_partitions(base_path,
                               'num': list(range(10)),
                               'date': np.arange('2017-01-01', '2017-01-11',
                                                 dtype='datetime64[D]')})
-    output_df["date"] = output_df["date"].astype('datetime64[ns]')
+    output_df["date"] = output_df["date"].astype('datetime64[ms]')

Review Comment:
   Modifying https://github.com/apache/arrow/pull/14460 to use ms, which is the 
new default conversion for [D]ays.



##########
python/pyarrow/tests/parquet/test_datetime.py:
##########
@@ -153,7 +154,7 @@ def test_coerce_timestamps_truncated(tempdir):
     df_ms = table_ms.to_pandas()
 
     arrays_expected = {'datetime64': [dt_ms, dt_ms]}
-    df_expected = pd.DataFrame(arrays_expected)
+    df_expected = pd.DataFrame(arrays_expected, dtype='datetime64[ms]')

Review Comment:
   Need to specify expected dtype to match the coercion now. 



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1105,7 +1113,8 @@ class MyDatetime(datetime):
         assert isinstance(table[0].chunk(0), pa.TimestampArray)
 
         result = table.to_pandas()
-        expected_df = pd.DataFrame({"datetime": date_array})
+        expected_df = pd.DataFrame(
+            {"datetime": pd.Series(date_array, dtype='datetime64[us]')})

Review Comment:
   Again, convert pandas default `ns` to Arrow default `us`



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1202,7 +1211,7 @@ def test_table_convert_date_as_object(self):
         df_datetime = table.to_pandas(date_as_object=False)
         df_object = table.to_pandas()
 
-        tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
+        tm.assert_frame_equal(df.astype('datetime64[ms]'), df_datetime,

Review Comment:
   Not yet, will add!



##########
python/pyarrow/tests/parquet/test_pandas.py:
##########
@@ -344,7 +344,7 @@ def test_index_column_name_duplicate(tempdir, 
use_legacy_dataset):
         }
     }
     path = str(tempdir / 'data.parquet')
-    dfx = pd.DataFrame(data).set_index('time', drop=False)
+    dfx = pd.DataFrame(data, dtype='datetime64[us]').set_index('time', 
drop=False)

Review Comment:
   Pandas defaults to `ns`, but Arrow defaults to `us`



##########
python/pyarrow/tests/parquet/test_datetime.py:
##########
@@ -64,12 +64,13 @@ def test_pandas_parquet_datetime_tz(use_legacy_dataset):
 
     arrow_table = pa.Table.from_pandas(df)
 
-    _write_table(arrow_table, f, coerce_timestamps='ms')
+    _write_table(arrow_table, f)

Review Comment:
   don't need to coerce anything. In pandas 1.x it gets coerced to `ns` 
anyways. Now in 2.0, it just causes an error.



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1458,7 +1459,6 @@ def 
test_write_to_dataset_with_partitions_and_custom_filenames(
                               'nan': [np.nan] * 10,
                               'date': np.arange('2017-01-01', '2017-01-11',
                                                 dtype='datetime64[D]')})
-    output_df["date"] = output_df["date"].astype('datetime64[ns]')

Review Comment:
   Reverting https://github.com/apache/arrow/pull/14460 - no longer needed



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1471,7 +1487,7 @@ def test_timestamp_to_pandas_empty_chunked(self):
         # ARROW-7907 table with chunked array with 0 chunks
         table = pa.table({'a': pa.chunked_array([], type=pa.timestamp('us'))})
         result = table.to_pandas()
-        expected = pd.DataFrame({'a': pd.Series([], dtype="datetime64[ns]")})
+        expected = pd.DataFrame({'a': pd.Series([], dtype="datetime64[us]")})

Review Comment:
   This is the Arrow default time unit



##########
python/pyarrow/tests/parquet/test_datetime.py:
##########
@@ -50,7 +50,7 @@
 @pytest.mark.pandas
 @parametrize_legacy_dataset
 def test_pandas_parquet_datetime_tz(use_legacy_dataset):
-    s = pd.Series([datetime.datetime(2017, 9, 6)])
+    s = pd.Series([datetime.datetime(2017, 9, 6)], dtype='datetime64[us]')

Review Comment:
   `us` is the Arrow default, but `ns` is the pandas default if dtype is not 
specified.



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1268,7 +1268,7 @@ def _test_write_to_dataset_with_partitions(base_path,
                               'nan': [np.nan] * 10,
                               'date': np.arange('2017-01-01', '2017-01-11',
                                                 dtype='datetime64[D]')})
-    output_df["date"] = output_df["date"].astype('datetime64[ns]')
+    output_df["date"] = output_df["date"].astype('datetime64[ms]')

Review Comment:
   Modifying https://github.com/apache/arrow/pull/14460 to use ms, which is the 
new default conversion for [D]ays.



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1054,7 +1062,7 @@ def test_python_datetime(self):
 
         result = table.to_pandas()
         expected_df = pd.DataFrame({
-            'datetime': date_array
+            'datetime': pd.Series(date_array, dtype='datetime64[us]')

Review Comment:
   Pandas default is `ns`, but Arrow default is `us`



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -1438,7 +1454,7 @@ def test_timestamp_to_pandas_ns(self):
     def test_timestamp_to_pandas_out_of_bounds(self):
         # ARROW-7758 check for out of bounds timestamps for non-ns timestamps
 
-        if Version(pd.__version__) >= Version("2.1.0.dev"):
+        if Version(pd.__version__) < Version("2.1.0.dev"):

Review Comment:
   TODO: I need to verify 1) this test works 2) which version to actually run 
it on



##########
python/pyarrow/tests/test_pandas.py:
##########
@@ -2850,7 +2866,7 @@ def test_strided_data_import(self):
         cases.append(boolean_objects)
 
         cases.append(np.arange("2016-01-01T00:00:00.001", N * K,
-                               dtype='datetime64[ms]').astype("datetime64[ns]")

Review Comment:
   Revert https://github.com/apache/arrow/pull/14460



##########
python/pyarrow/tests/parquet/common.py:
##########
@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
         # TODO(wesm): Test other timestamp resolutions now that arrow supports
         # them
         'datetime': np.arange("2016-01-01T00:00:00.001", size,
-                              dtype='datetime64[ms]').astype('datetime64[ns]'),
-        'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
+                              dtype='datetime64[ms]'),

Review Comment:
   Quickly enabling it does open up a small can of worms:
   ```
   FAILED 
pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-True] 
- pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-False]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-True] 
- pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-False]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[True]
 - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are 
different
   FAILED 
pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[False]
 - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are 
different
   FAILED pyarrow/tests/parquet/test_metadata.py::test_parquet_metadata_api - 
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED pyarrow/tests/parquet/test_metadata.py::test_compare_schemas - 
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_custom_metadata - 
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[True]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[False]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[True]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[False]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[300]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1000]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1300]
 - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   FAILED 
pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_reader[1000] - 
pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would 
lose data: 1451606400001000001
   ```
   
   Maybe better for a follow up PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Reply via email to