[
https://issues.apache.org/jira/browse/DATAFU-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537745#comment-16537745
]
Matthew Hayes commented on DATAFU-100:
--------------------------------------
I ran some tests comparing HyperLogLogPlus to using DISTINCT. I have each test
below. The metric for each is number of maps times avg map time plus number of
reduces plus avg reduce time. This captures the total amount of work done.
First number is HLLP and second number is DISTINCT.
1) 1 billion A-Z letters over 10 files (14589 vs. 5029)
2) 1 billion values between 0 and 1 million, over 10 files (14943 vs. 11684)
3) 250 million values (large keys) between 0 and 1 million, over 5 files (6032
vs. 6214)
So generally I find that the UDF is either slower than distinct or only
marginally better. I think given this it's better to deprecate the UDF. The
improvement even for #3 doesn't seem significant enough that it is worth
choosing to not get the exact number.
> Document recommendations on using HyperLogLogPlus
> -------------------------------------------------
>
> Key: DATAFU-100
> URL: https://issues.apache.org/jira/browse/DATAFU-100
> Project: DataFu
> Issue Type: Improvement
> Reporter: Matthew Hayes
> Priority: Minor
>
> We should provide recommendations about how to HyperLogLogPlus effectively.
> For example 1) how should the precision value be used, 2) when would a count
> distinct be better, etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)