[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

via GitHub Thu, 08 Jun 2023 08:13:58 -0700


danepitkin commented on code in PR #35656:
URL: https://github.com/apache/arrow/pull/35656#discussion_r1223194680



##########
python/pyarrow/types.pxi:
##########
@@ -40,10 +42,20 @@ cdef dict _pandas_type_map = {
     _Type_HALF_FLOAT: np.float16,
     _Type_FLOAT: np.float32,
     _Type_DOUBLE: np.float64,
-    _Type_DATE32: np.dtype('datetime64[ns]'),
-    _Type_DATE64: np.dtype('datetime64[ns]'),
-    _Type_TIMESTAMP: np.dtype('datetime64[ns]'),
-    _Type_DURATION: np.dtype('timedelta64[ns]'),
+    _Type_DATE32: np.dtype('datetime64[D]'),

Review Comment:
   ```
       
@pytest.mark.filterwarnings("ignore:'ParquetDataset.schema:FutureWarning")
       def _test_write_to_dataset_with_partitions(base_path,
                                                  use_legacy_dataset=True,
                                                  filesystem=None,
                                                  schema=None,
                                                  index_name=None):
           import pandas as pd
           import pandas.testing as tm
   
           import pyarrow.parquet as pq
   
           # ARROW-1400
           output_df = pd.DataFrame({'group1': list('aaabbbbccc'),
                                     'group2': list('eefeffgeee'),
                                     'num': list(range(10)),
                                     'nan': [np.nan] * 10,
                                     'date': np.arange('2017-01-01', 
'2017-01-11',
                                                       dtype='datetime64[D]')})
           output_df["date"] = output_df["date"]
           cols = output_df.columns.tolist()
           partition_by = ['group1', 'group2']
           output_table = pa.Table.from_pandas(output_df, schema=schema, 
safe=False,
                                               preserve_index=False)
           pq.write_to_dataset(output_table, base_path, partition_by,
                               filesystem=filesystem,
                               use_legacy_dataset=use_legacy_dataset)
   
           metadata_path = os.path.join(str(base_path), '_common_metadata')
   
           if filesystem is not None:
               with filesystem.open(metadata_path, 'wb') as f:
                   pq.write_metadata(output_table.schema, f)
           else:
               pq.write_metadata(output_table.schema, metadata_path)
   
           # ARROW-2891: Ensure the output_schema is preserved when writing a
           # partitioned dataset
           dataset = pq.ParquetDataset(base_path,
                                       filesystem=filesystem,
                                       validate_schema=True,
                                       use_legacy_dataset=use_legacy_dataset)
           # ARROW-2209: Ensure the dataset schema also includes the partition 
columns
           if use_legacy_dataset:
               with pytest.warns(FutureWarning, 
match="'ParquetDataset.schema'"):
                   dataset_cols = set(dataset.schema.to_arrow_schema().names)
           else:
               # NB schema property is an arrow and not parquet schema
               dataset_cols = set(dataset.schema.names)
   
           assert dataset_cols == set(output_table.schema.names)
   
           input_table = dataset.read(use_pandas_metadata=True)
   
           input_df = input_table.to_pandas()
   
           # Read data back in and compare with original DataFrame
           # Partitioned columns added to the end of the DataFrame when read
           input_df_cols = input_df.columns.tolist()
           assert partition_by == input_df_cols[-1 * len(partition_by):]
   
           input_df = input_df[cols]
           # Partitioned columns become 'categorical' dtypes
           for col in partition_by:
               output_df[col] = output_df[col].astype('category')
           # if schema is None and Version(pd.__version__) >= Version("2.0.0"):
           #     output_df['date'] = output_df['date'].astype('datetime64[ms]')
   >       tm.assert_frame_equal(output_df, input_df)
   E       AssertionError: Attributes of DataFrame.iloc[:, 4] (column 
name="date") are different
   E
   E       Attribute "dtype" are different
   E       [left]:  datetime64[s]
   E       [right]: datetime64[ms]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] danepitkin commented on a diff in pull request #35656: GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0

Reply via email to