[
https://issues.apache.org/jira/browse/ARROW-9594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325114#comment-17325114
]
Joris Van den Bossche commented on ARROW-9594:
----------------------------------------------
So for pandas that's indeed correct, because {{pd.Categorical}} storage model
is to use -1 to represent missing values in the integer codes (since pandas
uses numpy integer array to store the indices/codes of the categorical, and
numpy doesn't support missing values in integer arrays).
> [Python] DictionaryArray.to_numpy does not correctly convert null indexes to
> null values
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-9594
> URL: https://issues.apache.org/jira/browse/ARROW-9594
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.0
> Reporter: Steve M. Kim
> Priority: Major
> Fix For: 5.0.0
>
>
> Example
> {code:java}
> >>> a = pa.DictionaryArray.from_arrays(pa.array([0, 1, None, 0],
> >>> type=pa.int32()), pa.array(['foo', 'bar']))
> >>> a
> <pyarrow.lib.DictionaryArray object at 0x7f12fc94ccf0>-- dictionary:
> [
> "foo",
> "bar"
> ]
> -- indices:
> [
> 0,
> 1,
> null,
> 0
> ]
> >>> a.to_pandas() # this works
> 0 foo
> 1 bar
> 2 NaN
> 3 foo
> dtype: category
> Categories (2, object): [foo, bar]
> >>> a.to_numpy(zero_copy_only=False) # this is broken
> array(['foo', 'bar', 'bar', 'foo'], dtype=object)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)