jorisvandenbossche commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1255367993
##########
python/pyarrow/array.pxi:
##########
@@ -721,12 +722,16 @@ cdef class _PandasConvertible(_Weakrefable):
integer_object_nulls : bool, default False
Cast integers with nulls to objects
date_as_object : bool, default True
- Cast dates to objects. If False, convert to datetime64[ns] dtype.
+ Cast dates to objects. If False, convert to datetime64 dtype with
+ the equivalent time unit (if supported). Note: in pandas version
+ < 2.0, only datetime64[ns] conversion is supported.
+ dtype.
Review Comment:
```suggestion
< 2.0, only datetime64[ns] conversion is supported.
```
(leftover from previous iteration)
##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
# Partitioned columns become 'categorical' dtypes
for col in partition_by:
output_df[col] = output_df[col].astype('category')
+
+ if schema:
+ expected_date_type =
schema.field_by_name('date').type.to_pandas_dtype()
+ else:
+ # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+ # silently coerce to [ns] due to non-[ns] support.
+ expected_date_type = 'datetime64[ms]'
Review Comment:
This comment is not fully correct, I think (when converting the pandas
dataframe to pyarrow, we actually don't have date32, but timestamp type). But
then I also don't understand how this test is passing ..
##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
# Partitioned columns become 'categorical' dtypes
for col in partition_by:
output_df[col] = output_df[col].astype('category')
+
+ if schema:
+ expected_date_type =
schema.field_by_name('date').type.to_pandas_dtype()
+ else:
+ # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+ # silently coerce to [ns] due to non-[ns] support.
+ expected_date_type = 'datetime64[ms]'
Review Comment:
So what actually happens with pandas 2.x: when we create a DataFrame with
datetime64[D], that gets converted to datetime64[s] (closest supported
resolution to "D"). Then roundtripping to parquet turns that into "ms" (because
"s" is not supported by Parquet)
With older pandas this gets converted to datetime64[ns], will come back from
Parquet as "us", and converted back to "ns" when converting to pandas. But this
`astype("datetime64[ms]")` essentially doesn't do anything, i.e. pandas does
preserve the "ns" because it doesn't support "ms", and hence the test also
passes for older pandas.
Maybe it's simpler to just test with a DataFrame of nanoseconds, which now
works the same with old and new pandas, and then we don't have to add any
comment or astype.
##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1313,6 +1312,15 @@ def _test_write_to_dataset_with_partitions(base_path,
# Partitioned columns become 'categorical' dtypes
for col in partition_by:
output_df[col] = output_df[col].astype('category')
+
+ if schema:
+ expected_date_type =
schema.field_by_name('date').type.to_pandas_dtype()
+ else:
+ # Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
+ # silently coerce to [ns] due to non-[ns] support.
+ expected_date_type = 'datetime64[ms]'
Review Comment:
> Maybe it's simpler to just test with a DataFrame of nanoseconds, which now
works the same with old and new pandas, and then we don't have to add any
comment or astype.
Hmm, trying that out locally fails (but only with the non-legacy code path),
and digging in, it seems that we are still writing Parquet v 1 files with the
dataset API ...
Will open a separate issue and PR to quickly fix that separately.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]