[
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974399#comment-16974399
]
Thomas Buhrmann commented on ARROW-7168:
----------------------------------------
Ok, I think I found a workaround for converting an all-NaN categorical
pd.Series to dictionary array:
{code:python}
# Should be astype('string'), but pandas doesn't preserve NaNs
ser = pd.Series([np.nan, np.nan]).astype('object').astype('category')
arr = pa.DictionaryArray.from_arrays(
indices=-np.ones(len(ser), dtype=ser.cat.codes.dtype),
dictionary=np.array([], dtype='str'),
mask=np.ones(len(ser), dtype='bool'),
ordered=ser.cat.ordered)
print(arr.type)
pd.Series(arr.to_pandas())
{code}
which produces:
{noformat}
dictionary<values=string, indices=int8, ordered=0>
0 NaN
1 NaN
dtype: category
Categories (0, object): []
{noformat}
i.e. the 'str' value_type is now respected and the roundtrip produces the
correct result.
> pa.array() doesn't respect provided dictionary type with all NaNs
> -----------------------------------------------------------------
>
> Key: ARROW-7168
> URL: https://issues.apache.org/jira/browse/ARROW-7168
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.15.1
> Reporter: Thomas Buhrmann
> Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns.
> When creating a dictionary array, even when fully specifying the desired
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(),
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the
> possibility of all-NaN batches exists, even when trying to enforce that each
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array
> of the desired type containing only NaNs?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)