Re: [PR] GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) [arrow]

via GitHub Wed, 13 Nov 2024 06:52:11 -0800


jorisvandenbossche commented on code in PR #44195:
URL: https://github.com/apache/arrow/pull/44195#discussion_r1840481733



##########
python/pyarrow/pandas_compat.py:
##########
@@ -842,12 +844,25 @@ def _get_extension_dtypes(table, columns_metadata, 
types_mapper=None):
     and then we can check if this dtype supports conversion from arrow.
 
     """
+    strings_to_categorical = options["strings_to_categorical"]
+    categories = categories or []
+
     ext_columns = {}
 
     # older pandas version that does not yet support extension dtypes
     if _pandas_api.extension_dtype is None:
         return ext_columns
 
+    # for pandas 3.0+, use pandas' new default string dtype
+    if _pandas_api.uses_string_dtype() and not strings_to_categorical:
+        for field in table.schema:
+            if (
+                pa.types.is_string(field.type)
+                or pa.types.is_large_string(field.type)
+                or pa.types.is_string_view(field.type)
+            ) and field.name not in categories:

Review Comment:
   If `field.name in categories` is true, that means the user asked to convert 
this column to a categorical dtype on the pandas side. This is handled on the 
C++ side to dictionary encode the column, and so in this case we don't have to 
specify any custom pandas extension dtype here, because then our conversion 
layer will convert that dictionary encoded column to a pandas categorical.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) [arrow]

Reply via email to