GitHub user larvaboy opened a pull request:
https://github.com/apache/spark/pull/737
Implement ApproximateCountDistinct for SparkSql
Add the implementation for ApproximateCountDistinct to SparkSql. We use the
HyperLogLog algorithm implemented in stream-lib, and do the count in two
phases: 1) counting the number of distinct elements in each partitions, and 2)
merge the HyperLogLog results from different partitions.
A simple serializer and test cases are added as well.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/larvaboy/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/737.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #737
----
commit 871abec814fa15e9507a98ca1b4718429781efd7
Author: larvaboy <[email protected]>
Date: 2014-05-10T23:20:10Z
Fix a couple of minor typos.
commit f73651c8dc23fdd83b1bfb35bda135449f84c5c5
Author: larvaboy <[email protected]>
Date: 2014-05-11T23:15:35Z
Fix a minor typo in the toString method of the Count case class.
commit 25b46046c5e7a772dd25f2bd7ae711c9dabd3959
Author: larvaboy <[email protected]>
Date: 2014-05-12T09:25:59Z
Add SparkSql serializer for HyperLogLog.
commit 80f1da4a48d3929272a4436aee26531f03eab4aa
Author: larvaboy <[email protected]>
Date: 2014-05-12T09:38:16Z
Add ApproximateCountDistinct aggregates and functions.
We use stream-lib's HyperLogLog to approximately count the number of
distinct elements in each partition, and merge the HyperLogLogs to
compute the final result.
If the expressions can not be successfully broken apart, we fall back to
the exact CountDistinct.
commit 234a270a5e6766ad41b4fb49a54d42ddb4643264
Author: larvaboy <[email protected]>
Date: 2014-05-12T04:58:54Z
Add the parser for the approximate count.
commit cf73b921cfa901ffb40c848ca1961378475fea1a
Author: larvaboy <[email protected]>
Date: 2014-05-12T05:05:15Z
Add a test case for count distinct and approximate count distinct.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---