[jira] [Commented] (ARROW-7168) pa.array() doesn't respect provided dictionary type with all NaNs

Thomas Buhrmann (Jira) Thu, 14 Nov 2019 08:52:46 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974427#comment-16974427
 ]


Thomas Buhrmann commented on ARROW-7168:
----------------------------------------

Since I'm already at it, and in case somebody faces the same problem... To 
safely convert pandas categoricals to arrow, ensuring a constant type across 
batches, something like the following would work:
{code:python}
def categorical_to_arrow(ser, known_categories=None, ordered=None):
    """Safely create a pa.array from a categorial pd.Series.
    
    Args:
        ser (pd.Series): should be of CategorialDtype
        known_categories (np.array): force known categories. If None, and 
            the Series doesn't have any values to infer it from, will use 
            an empty array of the same dtype as the categories attribute
            of the Series
        ordered (bool): whether categories should be ordered 
    """
    n = len(ser)
    all_nan = ser.isna().sum() == n
       
    # Enforce provided categories, use the original ones, or enforce
    # the correct value_type if Arrow would otherwise change it to 'null'
    if isinstance(known_categories, np.ndarray):
        dictionary = known_categories
    elif all_nan:
        # value_type may be known, but Arrow doesn't understand 'object' dtype
        value_type = ser.cat.categories.dtype
        if value_type == 'object':
            value_type = 'str'
        dictionary = np.array([], dtype=value_type)
    else:
        dictionary = ser.cat.categories
        
    # Allow overwriting of ordered attribute
    if ordered is None:
        ordered = ser.cat.ordered

    if all_nan:
        return pa.DictionaryArray.from_arrays(
            indices=-np.ones(n, dtype=ser.cat.codes.dtype),
            dictionary=dictionary,
            mask=np.ones(n, dtype='bool'),
            ordered=ordered)
    else:
        return pa.DictionaryArray.from_arrays(
            indices=ser.cat.codes,
            dictionary=dictionary,
            ordered=ordered,
            from_pandas=True
        )
{code}
This seems to be the only ( ?) way to have control over the resulting 
dictionary type. E.g.:
{code:python}
# String categories with and without non-NaN values
sers = [
    pd.Series([None, None]).astype('object').astype('category'),
    pd.Series(['a', None, None]).astype('category')
]

# The categorical types we may want
known_categories = [
    None,
    np.array(['a', 'b', 'c'], dtype='str'),
    np.array([1, 2, 3], dtype='int8')
]

# Convert each series with each of the desired category types
for ser in sers:
    for cats in categories:
        arr = categorical_to_arrow(ser, known_categories=cats)
        ser2 = pd.Series(arr.to_pandas())
        print(f"Series: {list(ser)} | Known categories: {cats}")
        print(f"Dictionary type: {arr.type}")
        print(f"Roundtripped Series: \n{ser2}", "\n")
{code}
which produces:
{noformat}
Series: [nan, nan] | Known categories: None
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (0, object): [] 

Series: [nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: [nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary<values=int8, indices=int8, ordered=0>
Roundtripped Series: 
0    NaN
1    NaN
dtype: category
Categories (3, int64): [1, 2, 3] 

Series: ['a', nan, nan] | Known categories: None
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0      a
1    NaN
2    NaN
dtype: category
Categories (1, object): [a] 

Series: ['a', nan, nan] | Known categories: ['a' 'b' 'c']
Dictionary type: dictionary<values=string, indices=int8, ordered=0>
Roundtripped Series: 
0      a
1    NaN
2    NaN
dtype: category
Categories (3, object): [a, b, c] 

Series: ['a', nan, nan] | Known categories: [1 2 3]
Dictionary type: dictionary<values=int8, indices=int8, ordered=0>
Roundtripped Series: 
0      1
1    NaN
2    NaN
dtype: category
Categories (3, int64): [1, 2, 3] 
{noformat}
(the last example would correspond to a recoding of the categories, but that'd 
be a usage problem...)

> pa.array() doesn't respect provided dictionary type with all NaNs
> -----------------------------------------------------------------
>
>                 Key: ARROW-7168
>                 URL: https://issues.apache.org/jira/browse/ARROW-7168
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.15.1
>            Reporter: Thomas Buhrmann
>            Priority: Major
>
> This might be related to ARROW-6548 and others dealing with all NaN columns. 
> When creating a dictionary array, even when fully specifying the desired 
> type, this type is not respected when the data contains only NaNs:
> {code:python}
> # This may look a little artificial but easily occurs when processing 
> categorial data in batches and a particular batch containing only NaNs
> ser = pd.Series([None, None]).astype('object').astype('category')
> typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), 
> ordered=False)
> pa.array(ser, type=typ).type
> {code}
> results in
> {noformat}
> >> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)
> {noformat}
> which means that one cannot e.g. serialize batches of categoricals if the 
> possibility of all-NaN batches exists, even when trying to enforce that each 
> batch has the same schema (because the schema is not respected).
> I understand that inferring the type in this case would be difficult, but I'd 
> imagine that a fully specified type should be respected in this case?
> In the meantime, is there a workaround to manually create a dictionary array 
> of the desired type containing only NaNs?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7168) pa.array() doesn't respect provided dictionary type with all NaNs

Reply via email to