[
https://issues.apache.org/jira/browse/ARROW-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145984#comment-17145984
]
Wes McKinney commented on ARROW-9132:
-------------------------------------
This is actually a good deal more complicated than it seems in order to support
dictionaries that vary across chunked inputs. If the dictionary is constant
across all inputs to the hash kernel, then you can hash the indices and then
attach the singleton dictionary to the output. So if you had indices {{[3, 0,
1, 3, 0, 1]}}, then you output {{[3, 0, 1]}} with whatever dictionary was used.
If the dictionary varies then things are more complex now because either you
have to do dictionary unification or convert the input to non-dictionary and
then hash that.
Either way I personally won't be able to do this for 1.0.0
> [C++] Support unique kernel for dictionary type
> -----------------------------------------------
>
> Key: ARROW-9132
> URL: https://issues.apache.org/jira/browse/ARROW-9132
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Affects Versions: 0.17.1
> Reporter: Dave Hirschfeld
> Assignee: Wes McKinney
> Priority: Major
> Fix For: 1.0.0
>
>
> Enabling
> [`strings_as_dictionary`](https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html?highlight=strings_as_dictionary#obtaining-apache-arrow-result-sets)
> in `turbodbc` returns a `ChunkedArray` of `dictionary` type (IIUC).
> I'd like to enable this for better performance however it seems not all
> functionality is implemented for `dictionary` types? In particular, `unique`
> seems not to be implemented:
> {code}
> In [40]: nmi.__class__.mro()
> Out[40]: [pyarrow.lib.ChunkedArray, pyarrow.lib._PandasConvertible, object]
> In [41]: nmi.type
> Out[41]: DictionaryType(dictionary<values=string, indices=int32, ordered=0>)
> In [42]: nmi.unique()
> Traceback (most recent call last):
> File "<ipython-input-42-0fcb7893d5c4>", line 1, in <module>
> nmi.unique()
> File "pyarrow\table.pxi", line 307, in pyarrow.lib.ChunkedArray.unique
> File "pyarrow\error.pxi", line 106, in pyarrow.lib.check_status
> ArrowNotImplementedError: unique not implemented for
> dictionary<values=string, indices=int32, ordered=0>
> {code}
> It would be very useful if the `dictionary` type supported all the usual
> operations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)