[ 
https://issues.apache.org/jira/browse/MAHOUT-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537058#comment-13537058
 ] 

Ted Dunning commented on MAHOUT-1130:
-------------------------------------

A quick look indicates that, yes, this function is a public static.

Andrey,  

I am completely swamped.  The test needed here would be very simple.  It would 
need a sequence file 
with test data.  This could be created at runtime from synthetic data or could 
be in the test resources
directory.  Then the test would run through the data several times picking a 
small number of seeds
from this data file and recording whether the last vector in the input appears 
in the sample.  If the
input sample is large-ish (say 10,000 vectors) and the sample small (10 or so), 
then the last vector
should only appear in the sample rarely (1/1000 times).  With the bug as you 
assert it, the last vector
would be in the sample with very high probability.

If you add up the number of times the last few vectors appear in the sample, 
you should
either get a small number (correct) or a big number (incorrect).  This should 
be a very
sensitive test.

I don't have time to do this.  Do you?  It would really help the projet to get 
this right.
                
> Wrong logic in org.apache.mahout.clustering.kmeans.RandomSeedGenerator
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-1130
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1130
>             Project: Mahout
>          Issue Type: Bug
>         Environment: mahout 0.7 from maven central
>            Reporter: Andrey Davydov
>
> There is following code in line 101:
>               } else if (random.nextInt(currentSize + 1) != 0) { // with 
> chance 1/(currentSize+1) pick new element
> but it actually means pick new element with chance currentSize/(currentSize+1)
> so generator takes initial centers from the end of source data file.
> It seems that chance of replace vector in output set should decrease with 
> number of processed input vectors

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to