[ 
https://issues.apache.org/jira/browse/ARROW-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17145984#comment-17145984
 ] 

Wes McKinney commented on ARROW-9132:
-------------------------------------

This is actually a good deal more complicated than it seems in order to support 
dictionaries that vary across chunked inputs. If the dictionary is constant 
across all inputs to the hash kernel, then you can hash the indices and then 
attach the singleton dictionary to the output. So if you had indices {{[3, 0, 
1, 3, 0, 1]}}, then you output {{[3, 0, 1]}} with whatever dictionary was used.

If the dictionary varies then things are more complex now because either you 
have to do dictionary unification or convert the input to non-dictionary and 
then hash that.

Either way I personally won't be able to do this for 1.0.0

> [C++] Support unique kernel for dictionary type
> -----------------------------------------------
>
>                 Key: ARROW-9132
>                 URL: https://issues.apache.org/jira/browse/ARROW-9132
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>    Affects Versions: 0.17.1
>            Reporter: Dave Hirschfeld
>            Assignee: Wes McKinney
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Enabling 
> [`strings_as_dictionary`](https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html?highlight=strings_as_dictionary#obtaining-apache-arrow-result-sets)
>  in `turbodbc` returns a `ChunkedArray` of `dictionary` type (IIUC).
> I'd like to enable this for better performance however it seems not all 
> functionality is implemented for `dictionary` types? In particular, `unique` 
> seems not to be implemented:
> {code}
> In [40]: nmi.__class__.mro()
> Out[40]: [pyarrow.lib.ChunkedArray, pyarrow.lib._PandasConvertible, object]
> In [41]: nmi.type
> Out[41]: DictionaryType(dictionary<values=string, indices=int32, ordered=0>)
> In [42]: nmi.unique()
> Traceback (most recent call last):
>   File "<ipython-input-42-0fcb7893d5c4>", line 1, in <module>
>     nmi.unique()
>   File "pyarrow\table.pxi", line 307, in pyarrow.lib.ChunkedArray.unique
>   File "pyarrow\error.pxi", line 106, in pyarrow.lib.check_status
> ArrowNotImplementedError: unique not implemented for 
> dictionary<values=string, indices=int32, ordered=0>
> {code}
> It would be very useful if the `dictionary` type supported all the usual 
> operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to