[GitHub] [arrow] AlenkaF commented on a diff in pull request #34498: GH-34404: [Python] Failing tests because pandas.Index can now store all numeric dtypes (not only 64bit versions)

via GitHub Thu, 09 Mar 2023 07:49:19 -0800


AlenkaF commented on code in PR #34498:
URL: https://github.com/apache/arrow/pull/34498#discussion_r1131231811



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -735,8 +735,15 @@ def _partition_test_for_filesystem(fs, base_path, 
use_legacy_dataset=True):
                    .reset_index(drop=True)
                    .reindex(columns=result_df.columns))
 
-    expected_df['foo'] = pd.Categorical(df['foo'], categories=foo_keys)
-    expected_df['bar'] = pd.Categorical(df['bar'], categories=bar_keys)
+    if use_legacy_dataset or Version(pd.__version__) < Version("2.0.0"):
+        expected_df['foo'] = pd.Categorical(df['foo'], categories=foo_keys)
+        expected_df['bar'] = pd.Categorical(df['bar'], categories=bar_keys)
+    else:
+        # With pandas 2.0.0 Index can store all numeric dtypes (not just
+        # int64/uint64/float64). Using astype() to create a categorical
+        # column preserves original dtype (int32)
+        expected_df['foo'] = expected_df['foo'].astype("category")
+        expected_df['bar'] = expected_df['bar'].astype("category")

Review Comment:
   Unfortunately it doesn't: on older versions of pandas (and in the legacy 
dataset, donno why, didn't think it makes sense to investigate) the `foo` value 
type in `result_df ` is `int64` but `.astype("category")` would define the type 
of `foo` in `expected_df` as `int32`.
   
   Which is just the opposite in newer version of pandas: the `foo` value type 
in `result_df` is `int32` but `pd.Categorical` defines the type of `foo` in 
`expected_df` as `int64`.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF commented on a diff in pull request #34498: GH-34404: [Python] Failing tests because pandas.Index can now store all numeric dtypes (not only 64bit versions)

Reply via email to