[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

via GitHub Fri, 07 Jul 2023 02:49:10 -0700


jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1255367993



##########
python/pyarrow/array.pxi:
##########
@@ -721,12 +722,16 @@ cdef class _PandasConvertible(_Weakrefable):
         integer_object_nulls : bool, default False
             Cast integers with nulls to objects
         date_as_object : bool, default True
-            Cast dates to objects. If False, convert to datetime64[ns] dtype.
+            Cast dates to objects. If False, convert to datetime64 dtype with
+            the equivalent time unit (if supported). Note: in pandas version
+            < 2.0, only datetime64[ns] conversion is supported.
+            dtype.

Review Comment:
   ```suggestion
               < 2.0, only datetime64[ns] conversion is supported.
   ```
   
   (leftover from previous iteration)



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = 
schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'

Review Comment:
   This comment is not fully correct, I think (when converting the pandas 
dataframe to pyarrow, we actually don't have date32, but timestamp type). But 
then I also don't understand how this test is passing ..



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = 
schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'

Review Comment:
   So what actually happens with pandas 2.x: when we create a DataFrame with 
datetime64[D], that gets converted to datetime64[s] (closest supported 
resolution to "D"). Then roundtripping to parquet turns that into "ms" (because 
"s" is not supported by Parquet)
   
   With older pandas this gets converted to datetime64[ns], will come back from 
Parquet as "us", and converted back to "ns" when converting to pandas. But this 
`astype("datetime64[ms]")` essentially doesn't do anything, i.e. pandas does 
preserve the "ns" because it doesn't support "ms", and hence the test also 
passes for older pandas. 
   
   Maybe it's simpler to just test with a DataFrame of nanoseconds, which now 
works the same with old and new pandas, and then we don't have to add any 
comment or astype.



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
     # Partitioned columns become 'categorical' dtypes
     for col in partition_by:
         output_df[col] = output_df[col].astype('category')
+
+    if schema:
+        expected_date_type = 
schema.field_by_name('date').type.to_pandas_dtype()
+    else:
+        # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+        # silently coerce to [ns] due to non-[ns] support.
+        expected_date_type = 'datetime64[ms]'

Review Comment:
   > Maybe it's simpler to just test with a DataFrame of nanoseconds, which now 
works the same with old and new pandas, and then we don't have to add any 
comment or astype.
   
   Hmm, trying that out locally fails (but only with the non-legacy code path), 
and digging in, it seems that we are still writing Parquet v 1 files with the 
dataset API ... 
   Will open a separate issue and PR to quickly fix that separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Reply via email to