[
https://issues.apache.org/jira/browse/ARROW-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ian Cook updated ARROW-12728:
-----------------------------
Summary: [C++][Compute] Implement count_distinct/distinct hash aggregate
kernels (was: [C++][Compute] Aggregates: implement count distinct)
> [C++][Compute] Implement count_distinct/distinct hash aggregate kernels
> ------------------------------------------------------------------------
>
> Key: ARROW-12728
> URL: https://issues.apache.org/jira/browse/ARROW-12728
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 4.0.0
> Reporter: Michal Nowakiewicz
> Assignee: David Li
> Priority: Major
> Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> Implement count distinct aggregate reusing hash table from hash group by
> inside of it.
> This brings support to SQL queries like:
> select a, count(distinct b), count(distinct c) from t group by a
> For instance to compute count(distinct b), the first group id mapping will
> give group id based on column a value; then the second group id mapping is
> done using the key (groupid(a), b) inside count(distinct b) aggregate
> (similarly for count(distinct c)).
> After all input rows are consumed, the final processing step scans the hash
> tables based on (groupid(a), b) and updates an array of counts indexed by
> groupid(a).
> The resulting array of counts represents the output of count distinct
> aggregate.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)