[
https://issues.apache.org/jira/browse/ARROW-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou resolved ARROW-2515.
-----------------------------------
Resolution: Fixed
Fix Version/s: 0.10.0
Issue resolved by pull request 1954
[https://github.com/apache/arrow/pull/1954]
> Errors with DictionaryArray inside of ListArray or other DictionaryArray
> ------------------------------------------------------------------------
>
> Key: ARROW-2515
> URL: https://issues.apache.org/jira/browse/ARROW-2515
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.9.0
> Reporter: Brent Kerby
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.10.0
>
> Original Estimate: 1h
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> An exception ("KeyError: 26") is raised when .as_py() is called on elements
> of a ListArray over a DictionaryArray, or of a DictionaryArray with values in
> a DictionaryArray. Here are a couple tests that currently fail:
>
> {code:java}
> import pyarrow as pa
> def test_dictionary_array_1():
> dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
> list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
> assert list_arr.to_pylist() == [['a', 'b'], ['a']]
> def test_dictionary_array_2():
> dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
> dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
> assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
> {code}
> It appears that the problem is caused by the fact that the function
> box_scalar in scalar.pxi does not handle the case of dictionary array, as we
> currently have no DictionaryValue type.
> DictionaryArray.__getitem__ currently works around the lack of
> DictionaryValue type by dereferencing the index and constructing a scalar
> based on the value in the underlying dictionary. In other words, if we have a
> dictionary with int8 indices and string values, then the result of
> __getitem__ will be a StringValue (rather than a DictionaryValue). This works
> in simple cases but not in the more complex scenarios illustrated above.
> I have a patch ready, which would add a DictionaryValue type similar to other
> scalar types, resolving these bugs and removing the need for a special-cased
> implementation of DictionaryArray.__getitem__. This DictionaryValue would
> contain a couple accessor properties, "indices_value" and "dictionary_value"
> to allow access to both the index in the dictionary as well as the looked-up
> value. Then DictionaryValue.as_py() would simply call .as_py() on the
> underlying dictionary_value.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)