[arrow] branch main updated: GH-38034: [Python] DataFrame Interchange Protocol - correct dtype information for categorical columns (#38065)

alenka Mon, 09 Oct 2023 21:05:36 -0700

This is an automated email from the ASF dual-hosted git repository.

alenka pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/main by this push:
     new db420c9cb6 GH-38034: [Python] DataFrame Interchange Protocol - correct 
dtype information for categorical columns (#38065)
db420c9cb6 is described below

commit db420c9cb6d0d93ce13aec06c59a9ae2d4c775f6
Author: Alenka Frim <[email protected]>
AuthorDate: Tue Oct 10 06:05:24 2023 +0200

    GH-38034: [Python] DataFrame Interchange Protocol - correct dtype 
information for categorical columns (#38065)
    
    ### Rationale for this change
    See: https://github.com/apache/arrow/issues/38034#issue-1927839216
    
    ### What changes are included in this PR?
    
    The `f_string` for the columns with categorical dtype is now corrected to 
reflect the type of the indices from the dictionary data type. Bit width has 
been correct before. From the spec:
    
    > For categoricals, the format string describes the type of the
                  categorical in the data buffer. In case of a separate 
encoding of
                  the categorical (e.g. an integer to string mapping), this can
                  be derived from ``self.describe_categorical``.
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    No.
    * Closes: #38034
    
    Authored-by: AlenkaF <[email protected]>
    Signed-off-by: AlenkaF <[email protected]>
---
 python/pyarrow/interchange/column.py                 |  4 +++-
 .../tests/interchange/test_interchange_spec.py       | 20 ++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/python/pyarrow/interchange/column.py 
b/python/pyarrow/interchange/column.py
index a9b8958616..eaf7834d5b 100644
--- a/python/pyarrow/interchange/column.py
+++ b/python/pyarrow/interchange/column.py
@@ -312,7 +312,9 @@ class _PyArrowColumn:
             return kind, bit_width, f_string, Endianness.NATIVE
         elif pa.types.is_dictionary(dtype):
             kind = DtypeKind.CATEGORICAL
-            f_string = "L"
+            arr = self._col
+            indices_dtype = arr.indices.type
+            _, f_string = _PYARROW_KINDS.get(indices_dtype)
             return kind, bit_width, f_string, Endianness.NATIVE
         else:
             kind, f_string = _PYARROW_KINDS.get(dtype, (None, None))
diff --git a/python/pyarrow/tests/interchange/test_interchange_spec.py 
b/python/pyarrow/tests/interchange/test_interchange_spec.py
index 7b2b8eb720..826089652b 100644
--- a/python/pyarrow/tests/interchange/test_interchange_spec.py
+++ b/python/pyarrow/tests/interchange/test_interchange_spec.py
@@ -266,3 +266,23 @@ def test_buffer(int, use_batch):
         for idx, truth in enumerate(arr):
             val = ctype.from_address(dataBuf.ptr + idx * (bitwidth // 8)).value
             assert val == truth, f"Buffer at index {idx} mismatch"
+
+
[email protected](
+    "indices_type, bitwidth, f_string", [
+        (pa.int8(), 8, "c"),
+        (pa.int16(), 16, "s"),
+        (pa.int32(), 32, "i"),
+        (pa.int64(), 64, "l")
+    ]
+)
+def test_categorical_dtype(indices_type, bitwidth, f_string):
+    type = pa.dictionary(indices_type, pa.string())
+    arr = pa.array(["a", "b", None, "d"], type)
+    table = pa.table({'a': arr})
+
+    df = table.__dataframe__()
+    col = df.get_column(0)
+    assert col.dtype[0] == 23  # <DtypeKind.CATEGORICAL: 23>
+    assert col.dtype[1] == bitwidth
+    assert col.dtype[2] == f_string

[arrow] branch main updated: GH-38034: [Python] DataFrame Interchange Protocol - correct dtype information for categorical columns (#38065)

Reply via email to