[ 
https://issues.apache.org/jira/browse/MAHOUT-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978451#comment-13978451
 ] 

Maxim Arap commented on MAHOUT-1468:
------------------------------------

Andrew: The default initial value for numClusters is 20, which seems arbitrary. 
As the algorithm runs, numClusters will grow to roughly k log(n), where k is 
the final number of clusters (that BallKMeans step will output) and n is the 
size of the dataset. In practice k log(n) can be much larger than 20, depending 
on the dataset. 

Suneel: I tried running the algorithm both in the sequential mode and in 
mapreduce mode on Reuters data last night but both gave me runtime errors. The 
reason maybe that my laptop has hadoop-2.2.0, which may not be compatible with 
mahout at this point. 

> Creating a new page for StreamingKMeans documentation on mahout website
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-1468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1468
>             Project: Mahout
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.0
>            Reporter: Pavan Kumar N
>            Assignee: Andrew Musselman
>              Labels: Documentation
>             Fix For: 1.0
>
>         Attachments: StreamingKMeans.txt
>
>
> Separate page required on Streaming K Means algorithm description and 
> overview, explaining the various parameters can be used in streamingkmeans, 
> strategy for parallelization, link to this paper: 
> http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to