[ 
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236111#comment-16236111
 ] 

ASF GitHub Bot commented on ARROW-1559:
---------------------------------------

wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-341486675
 
 
   I would be happy to write some design documents, but in exchange I need some 
more help moving along routine development and maintenance of the project. The 
honest truth is that I simply haven't had enough time to really think through 
all of the details of what the kernel APIs need to look like -- for example, we 
need to define an object model to accommodate scalar values (for example, 
adding a scalar to an array, or casting a scalar from one type to another). The 
code for the cast kernels could be a lot cleaner than it is. I have a high 
tolerance for this kind of uncertainty and don't mind doing a lot of 
refactoring as we figure out the right general shape of the kernel-operator 
API. I've also been looking at libraries like Dremio and TensorFlow a bit for 
inspiration. 
   
   At a high level, the purpose of the kernels is to be able to perform 
analytics on chunked arrays. Depending on the operator, the kernels may need to 
be stateful (e.g. reductions, hash-table based analytics) or stateless 
(elementwise functions, NumPy ufunc-like math, etc.). Operators will need to be 
able to accommodate dispatch to different variants of kernels (e.g. 
SIMD-enabled, non-SIMD, GPU)
   
   I'd like to spend the majority of my time working solely on kernels and 
analytics, but I can't do that at the expense of all the other small things 
that would fall through the cracks otherwise. Out of the 0.8.0 milestone so 
far, I have resolved 64 JIRAs -- the rest of the Arrow community has done 67. 
So my effective burden of moving along the project only looking at JIRAs is 
around 50%. When you add release management and PR maintenance, the number goes 
above 50% for sure. 
   
   This is not a complaint, just pointing out with things as they are I am not 
sure I can do more than I'm already doing. I will be more than happy do more 
design and architecture work as soon as the community starts sharing more of 
the development workload. At the moment, to let the development work drop to 
write more documentation and design docs seems like an unacceptable compromise 
to me. Getting the project to a format stable 1.0.0 release and to make it 
suitable for production use for data interchange in Apache Spark and elsewhere 
is the most important thing for me right now, and engineering work in support 
of that is going to take priority over design docs
   
   I'm on a plane right now so I'm going to hack on these hash kernels for 
several hours and see how far I can get

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> ------------------------------------------------------------------------------
>
>                 Key: ARROW-1559
>                 URL: https://issues.apache.org/jira/browse/ARROW-1559
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Uwe L. Korn
>            Priority: Major
>              Labels: Analytics, pull-request-available
>             Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to