[jira] [Commented] (ARROW-7168) [Python] pa.array() doesn't respect specified dictionary type

Thomas Buhrmann (Jira) Fri, 15 Nov 2019 01:42:46 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974964#comment-16974964
 ]


Thomas Buhrmann commented on ARROW-7168:
----------------------------------------

Yes, that's right. I didn't notice it silently 'failing' in other cases because 
I usually construct the type explicitly to match.

I guess it should be a relatively easy fix, since as I show above, one can 
construct an all-NaN DictionaryArray using from_arrays() with negative indices, 
a np.array with desired type as dictionary, and setting the mask. I haven't 
checked under the hood why using -1 as indices works without setting 
from_pandas=True, and so I'm not sure if this is the best way to create the 
array, but it seems to work in practice...

> [Python] pa.array() doesn't respect specified dictionary type
> -------------------------------------------------------------
>
>                 Key: ARROW-7168
>                 URL: https://issues.apache.org/jira/browse/ARROW-7168
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.15.1
>            Reporter: Thomas Buhrmann
>            Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7168) [Python] pa.array() doesn't respect specified dictionary type

Reply via email to