GitHub user larvaboy opened a pull request:

    https://github.com/apache/spark/pull/737

    Implement ApproximateCountDistinct for SparkSql

    Add the implementation for ApproximateCountDistinct to SparkSql. We use the 
HyperLogLog algorithm implemented in stream-lib, and do the count in two 
phases: 1) counting the number of distinct elements in each partitions, and 2) 
merge the HyperLogLog results from different partitions.
    
    A simple serializer and test cases are added as well.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/larvaboy/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/737.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #737
    
----
commit 871abec814fa15e9507a98ca1b4718429781efd7
Author: larvaboy <[email protected]>
Date:   2014-05-10T23:20:10Z

    Fix a couple of minor typos.

commit f73651c8dc23fdd83b1bfb35bda135449f84c5c5
Author: larvaboy <[email protected]>
Date:   2014-05-11T23:15:35Z

    Fix a minor typo in the toString method of the Count case class.

commit 25b46046c5e7a772dd25f2bd7ae711c9dabd3959
Author: larvaboy <[email protected]>
Date:   2014-05-12T09:25:59Z

    Add SparkSql serializer for HyperLogLog.

commit 80f1da4a48d3929272a4436aee26531f03eab4aa
Author: larvaboy <[email protected]>
Date:   2014-05-12T09:38:16Z

    Add ApproximateCountDistinct aggregates and functions.
    
    We use stream-lib's HyperLogLog to approximately count the number of
    distinct elements in each partition, and merge the HyperLogLogs to
    compute the final result.
    
    If the expressions can not be successfully broken apart, we fall back to
    the exact CountDistinct.

commit 234a270a5e6766ad41b4fb49a54d42ddb4643264
Author: larvaboy <[email protected]>
Date:   2014-05-12T04:58:54Z

    Add the parser for the approximate count.

commit cf73b921cfa901ffb40c848ca1961378475fea1a
Author: larvaboy <[email protected]>
Date:   2014-05-12T05:05:15Z

    Add a test case for count distinct and approximate count distinct.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to