[ 
https://issues.apache.org/jira/browse/ARROW-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2515:
----------------------------------
    Labels: pull-request-available  (was: )

> Errors with DictionaryArray inside of ListArray or other DictionaryArray
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2515
>                 URL: https://issues.apache.org/jira/browse/ARROW-2515
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Brent Kerby
>            Priority: Major
>              Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> An exception ("KeyError: 26") is raised when .as_py() is called on elements 
> of a ListArray over a DictionaryArray, or of a DictionaryArray with values in 
> a DictionaryArray. Here are a couple tests that currently fail:
>  
> {code:java}
> import pyarrow as pa
> def test_dictionary_array_1():
>     dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
>     list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
>     assert list_arr.to_pylist() == [['a', 'b'], ['a']]
> def test_dictionary_array_2():
>     dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
>     dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
>     assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
> {code}
> It appears that the problem is caused by the fact that the function 
> box_scalar in scalar.pxi does not handle the case of dictionary array, as we 
> currently have no DictionaryValue type. 
> DictionaryArray.__getitem__ currently works around the lack of 
> DictionaryValue type by dereferencing the index and constructing a scalar 
> based on the value in the underlying dictionary. In other words, if we have a 
> dictionary with int8 indices and string values, then the result of 
> __getitem__ will be a StringValue (rather than a DictionaryValue). This works 
> in simple cases but not in the more complex scenarios illustrated above.
> I have a patch ready, which would add a DictionaryValue type similar to other 
> scalar types, resolving these bugs and removing the need for a special-cased 
> implementation of DictionaryArray.__getitem__. This DictionaryValue would 
> contain a couple accessor properties, "indices_value" and "dictionary_value" 
> to allow access to both the index in the dictionary as well as the looked-up 
> value. Then DictionaryValue.as_py() would simply call .as_py() on the 
> underlying dictionary_value. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to