[
https://issues.apache.org/jira/browse/ARROW-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236111#comment-16236111
]
ASF GitHub Bot commented on ARROW-1559:
---------------------------------------
wesm commented on issue #1266: WIP: ARROW-1559: Add unique kernel
URL: https://github.com/apache/arrow/pull/1266#issuecomment-341486675
I would be happy to write some design documents, but in exchange I need some
more help moving along routine development and maintenance of the project. The
honest truth is that I simply haven't had enough time to really think through
all of the details of what the kernel APIs need to look like -- for example, we
need to define an object model to accommodate scalar values (for example,
adding a scalar to an array, or casting a scalar from one type to another). The
code for the cast kernels could be a lot cleaner than it is. I have a high
tolerance for this kind of uncertainty and don't mind doing a lot of
refactoring as we figure out the right general shape of the kernel-operator
API. I've also been looking at libraries like Dremio and TensorFlow a bit for
inspiration.
At a high level, the purpose of the kernels is to be able to perform
analytics on chunked arrays. Depending on the operator, the kernels may need to
be stateful (e.g. reductions, hash-table based analytics) or stateless
(elementwise functions, NumPy ufunc-like math, etc.). Operators will need to be
able to accommodate dispatch to different variants of kernels (e.g.
SIMD-enabled, non-SIMD, GPU)
I'd like to spend the majority of my time working solely on kernels and
analytics, but I can't do that at the expense of all the other small things
that would fall through the cracks otherwise. Out of the 0.8.0 milestone so
far, I have resolved 64 JIRAs -- the rest of the Arrow community has done 67.
So my effective burden of moving along the project only looking at JIRAs is
around 50%. When you add release management and PR maintenance, the number goes
above 50% for sure.
This is not a complaint, just pointing out with things as they are I am not
sure I can do more than I'm already doing. I will be more than happy do more
design and architecture work as soon as the community starts sharing more of
the development workload. At the moment, to let the development work drop to
write more documentation and design docs seems like an unacceptable compromise
to me. Getting the project to a format stable 1.0.0 release and to make it
suitable for production use for data interchange in Apache Spark and elsewhere
is the most important thing for me right now, and engineering work in support
of that is going to take priority over design docs
I'm on a plane right now so I'm going to hack on these hash kernels for
several hours and see how far I can get
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> [C++] Kernel implementations for "unique" (compute distinct elements of array)
> ------------------------------------------------------------------------------
>
> Key: ARROW-1559
> URL: https://issues.apache.org/jira/browse/ARROW-1559
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Uwe L. Korn
> Priority: Major
> Labels: Analytics, pull-request-available
> Fix For: 0.8.0
>
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)