[
https://issues.apache.org/jira/browse/IMPALA-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17094335#comment-17094335
]
Gabor Kaszab commented on IMPALA-9593:
--------------------------------------
[~jbapple] If we consider simple running a count(distinct) approximation for a
specific set of data then ndv() would be enough.
What this DataSketches library brings:
1) It can produce a sketch from a set of input data (e.g. column of a table)
that we can write to a table or view and as a second step we can simply use the
pre-calculated sketch to produce the count(distinct) estimate without scanning
any data other than the sketch.
2) Another gain with this is that Hive is also planning to add support for
DataSketches algorithms and as a result it will be feasible to e.g. calculate a
sketch with Hive and use it to produce the estimate with Impala and vice versa.
3) These sketches can be merged with each other, so we can for instance
pre-calculate a sketch for each partition in a table and merge the ones that
are relevant for the particular query, again without scanning the table itself.
> Implement count(distinct) function (DataSketches/HLL)
> -----------------------------------------------------
>
> Key: IMPALA-9593
> URL: https://issues.apache.org/jira/browse/IMPALA-9593
> Project: IMPALA
> Issue Type: Epic
> Components: Backend
> Reporter: Boglarka Egyed
> Assignee: Gabor Kaszab
> Priority: Major
>
> Implement the count(distinct) function from the DataSketches library for HLL
> in C++.
> General info about the sketch:
> http://datasketches.apache.org/docs/HLL/HLL.html
> C++ implementation to wrap:
> https://github.com/apache/incubator-datasketches-cpp/tree/master/hll
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]