[ https://issues.apache.org/jira/browse/DATAFU-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539080#comment-16539080 ]
Matthew Hayes commented on DATAFU-100: -------------------------------------- Here are the max map and reduce times. First pair is HLLP and second pair is DISTINCT. By this metric it is also worse. 1) 1 billion A-Z letters over 10 files (1072, 519 vs 356, 80) 2) 1 billion values between 0 and 1 million, over 10 files (385, 799 vs 253, 487) 3) 250 million values (large keys) between 0 and 1 million, over 5 files (62, 297 vs 50, 318) Regarding the number of files, I think that this could only contribute to worse performance in this particular case because there could be more overhead. > Document recommendations on using HyperLogLogPlus > ------------------------------------------------- > > Key: DATAFU-100 > URL: https://issues.apache.org/jira/browse/DATAFU-100 > Project: DataFu > Issue Type: Improvement > Reporter: Matthew Hayes > Priority: Minor > Attachments: DATAFU-100.patch > > > We should provide recommendations about how to HyperLogLogPlus effectively. > For example 1) how should the precision value be used, 2) when would a count > distinct be better, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)