[
https://issues.apache.org/jira/browse/MAHOUT-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069863#comment-13069863
]
Ted Dunning commented on MAHOUT-753:
------------------------------------
Murmur hash should also be good if fed from a sequence counter multiplied by a
large prime. This is a weak form of congruential generator.
Lance, the best test for distribution would be a one-sided KS test. I don't
think that we have one handy, but it is very easy to build one from spare
parts. See http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
The basic idea is that you find the largest positive and negative differences
of the empirical cumulative distribution function from the theoretically
desired cumulative distribution function. The size of these errors is a good
measure of how different the distributions should be. At 1 mega-sample, this
difference should be less than about 0.002. Commons math has a way to compute
the test statistic distribution if we really care about the details.
> MurmurHashRandom class: subclass of java.util.Random that uses MurmurHash
> -------------------------------------------------------------------------
>
> Key: MAHOUT-753
> URL: https://issues.apache.org/jira/browse/MAHOUT-753
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Lance Norskog
> Assignee: Sean Owen
> Priority: Minor
> Attachments: MurmurBench.java, MurmurHashRandom.java
>
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira