[ 
https://issues.apache.org/jira/browse/MAHOUT-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978451#comment-13978451
 ] 

Maxim Arap edited comment on MAHOUT-1468 at 4/23/14 5:01 PM:
-------------------------------------------------------------

Andrew: The default initial value for numClusters is 20, which seems arbitrary. 
As the algorithm runs, numClusters will grow to roughly k log n, where k is the 
final number of clusters (that BallKMeans step will output) and n is the size 
of the dataset. In practice k log n can be much larger than 20, depending on 
the dataset and the final number of clusters k. 

Suneel: I tried running the algorithm both in the sequential mode and in 
mapreduce mode on Reuters data last night but both gave me runtime errors. The 
reason maybe that my laptop has hadoop-2.2.0, which may not be compatible with 
mahout at this point. I'll try to run it on an earlier version of hadoop 
tonight. 


was (Author: arapmv):
Andrew: The default initial value for numClusters is 20, which seems arbitrary. 
As the algorithm runs, numClusters will grow to roughly k log n, where k is the 
final number of clusters (that BallKMeans step will output) and n is the size 
of the dataset. In practice k log n can be much larger than 20, depending on 
the dataset and the final number of clusters k. 

Suneel: I tried running the algorithm both in the sequential mode and in 
mapreduce mode on Reuters data last night but both gave me runtime errors. The 
reason maybe that my laptop has hadoop-2.2.0, which may not be compatible with 
mahout at this point. 

> Creating a new page for StreamingKMeans documentation on mahout website
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-1468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1468
>             Project: Mahout
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.0
>            Reporter: Pavan Kumar N
>            Assignee: Andrew Musselman
>              Labels: Documentation
>             Fix For: 1.0
>
>         Attachments: StreamingKMeans.txt
>
>
> Separate page required on Streaming K Means algorithm description and 
> overview, explaining the various parameters can be used in streamingkmeans, 
> strategy for parallelization, link to this paper: 
> http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to