Joris Van den Bossche created ARROW-7591:
--------------------------------------------
Summary: [Python] DictionaryArray.to_numpy returns dict of parts
instead of numpy array
Key: ARROW-7591
URL: https://issues.apache.org/jira/browse/ARROW-7591
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Joris Van den Bossche
Currently, the {{to_numpy}} method doesn't return an ndarray incase of
dictionaryd type data:
{code}
In [54]: a = pa.array(pd.Categorical(["a", "b", "a"]))
In [55]: a
Out[55]:
<pyarrow.lib.DictionaryArray object at 0x7f5c63d98f28>
-- dictionary:
[
"a",
"b"
]
-- indices:
[
0,
1,
0
]
In [57]: a.to_numpy(zero_copy_only=False)
Out[57]:
{'indices': array([0, 1, 0], dtype=int8),
'dictionary': array(['a', 'b'], dtype=object),
'ordered': False}
{code}
This is actually just an internal representation that is passed from C++ to
Python so on the Python side a {{pd.Categorical}} / {{CategoricalBlock}} can be
constructed, but it's not something we should return as such to the user.
Rather, I think we should return a decoded / dense numpy array (or at least
error instead of returning this dict)
(also, if the user wants those parts, they are already available from the
dictionary array as {{a.indices}}, {{a.dictionary}} and {{a.type.ordered}})
--
This message was sent by Atlassian Jira
(v8.3.4#803005)