This is an automated email from the ASF dual-hosted git repository.
alenka pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new db420c9cb6 GH-38034: [Python] DataFrame Interchange Protocol - correct
dtype information for categorical columns (#38065)
db420c9cb6 is described below
commit db420c9cb6d0d93ce13aec06c59a9ae2d4c775f6
Author: Alenka Frim <[email protected]>
AuthorDate: Tue Oct 10 06:05:24 2023 +0200
GH-38034: [Python] DataFrame Interchange Protocol - correct dtype
information for categorical columns (#38065)
### Rationale for this change
See: https://github.com/apache/arrow/issues/38034#issue-1927839216
### What changes are included in this PR?
The `f_string` for the columns with categorical dtype is now corrected to
reflect the type of the indices from the dictionary data type. Bit width has
been correct before. From the spec:
> For categoricals, the format string describes the type of the
categorical in the data buffer. In case of a separate
encoding of
the categorical (e.g. an integer to string mapping), this can
be derived from ``self.describe_categorical``.
### Are these changes tested?
Yes.
### Are there any user-facing changes?
No.
* Closes: #38034
Authored-by: AlenkaF <[email protected]>
Signed-off-by: AlenkaF <[email protected]>
---
python/pyarrow/interchange/column.py | 4 +++-
.../tests/interchange/test_interchange_spec.py | 20 ++++++++++++++++++++
2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/python/pyarrow/interchange/column.py
b/python/pyarrow/interchange/column.py
index a9b8958616..eaf7834d5b 100644
--- a/python/pyarrow/interchange/column.py
+++ b/python/pyarrow/interchange/column.py
@@ -312,7 +312,9 @@ class _PyArrowColumn:
return kind, bit_width, f_string, Endianness.NATIVE
elif pa.types.is_dictionary(dtype):
kind = DtypeKind.CATEGORICAL
- f_string = "L"
+ arr = self._col
+ indices_dtype = arr.indices.type
+ _, f_string = _PYARROW_KINDS.get(indices_dtype)
return kind, bit_width, f_string, Endianness.NATIVE
else:
kind, f_string = _PYARROW_KINDS.get(dtype, (None, None))
diff --git a/python/pyarrow/tests/interchange/test_interchange_spec.py
b/python/pyarrow/tests/interchange/test_interchange_spec.py
index 7b2b8eb720..826089652b 100644
--- a/python/pyarrow/tests/interchange/test_interchange_spec.py
+++ b/python/pyarrow/tests/interchange/test_interchange_spec.py
@@ -266,3 +266,23 @@ def test_buffer(int, use_batch):
for idx, truth in enumerate(arr):
val = ctype.from_address(dataBuf.ptr + idx * (bitwidth // 8)).value
assert val == truth, f"Buffer at index {idx} mismatch"
+
+
[email protected](
+ "indices_type, bitwidth, f_string", [
+ (pa.int8(), 8, "c"),
+ (pa.int16(), 16, "s"),
+ (pa.int32(), 32, "i"),
+ (pa.int64(), 64, "l")
+ ]
+)
+def test_categorical_dtype(indices_type, bitwidth, f_string):
+ type = pa.dictionary(indices_type, pa.string())
+ arr = pa.array(["a", "b", None, "d"], type)
+ table = pa.table({'a': arr})
+
+ df = table.__dataframe__()
+ col = df.get_column(0)
+ assert col.dtype[0] == 23 # <DtypeKind.CATEGORICAL: 23>
+ assert col.dtype[1] == bitwidth
+ assert col.dtype[2] == f_string