[
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234464#comment-16234464
]
ASF GitHub Bot commented on ARROW-1559:
---------------------------------------
xhochy commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-341184022
At the moment, I also tend to step a bit back and first have a look at this
again in a design document. There are several issues where I have no clear
opinion yet but that would probably require some thinking:
* Do we need kernel call methods for each level of
Array/ChunkedArray/Column? Having them instead of a generic `InvokeUnary` on
each of the three data structures might lead to a lot of code duplication or
simple pass-through functions. Otherwise having an `InvokeUnary` method would
prohibit us from doing some optimizations in the case that we pass over several
arrays in a column and could do some operations only once.
* My use case here is to selective categorical conversion, my initial
approach was to implement `unique(column)` and then use this to create a
`DictionaryType` instance that would then be fed to all underlying arrays to
make the categorical conversion. This might not be the best solution as the
`DictionaryType` instance doesn't contain the hash map anymore and would have
to reconstruct it.
Also, do we in general have a design document for the kernels? We need to
think about state, parallelisation, .. in general. I might have missed this but
I think having it integrated into the Arrow documentation will ease entry for
future contributors (and myself).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> ------------------------------------------------------------------------------
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Uwe L. Korn
> Priority: Major
> Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)