[
https://issues.apache.org/jira/browse/MAHOUT-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978451#comment-13978451
]
Maxim Arap commented on MAHOUT-1468:
------------------------------------
Andrew: The default initial value for numClusters is 20, which seems arbitrary.
As the algorithm runs, numClusters will grow to roughly k log(n), where k is
the final number of clusters (that BallKMeans step will output) and n is the
size of the dataset. In practice k log(n) can be much larger than 20, depending
on the dataset.
Suneel: I tried running the algorithm both in the sequential mode and in
mapreduce mode on Reuters data last night but both gave me runtime errors. The
reason maybe that my laptop has hadoop-2.2.0, which may not be compatible with
mahout at this point.
> Creating a new page for StreamingKMeans documentation on mahout website
> -----------------------------------------------------------------------
>
> Key: MAHOUT-1468
> URL: https://issues.apache.org/jira/browse/MAHOUT-1468
> Project: Mahout
> Issue Type: Documentation
> Components: Documentation
> Affects Versions: 1.0
> Reporter: Pavan Kumar N
> Assignee: Andrew Musselman
> Labels: Documentation
> Fix For: 1.0
>
> Attachments: StreamingKMeans.txt
>
>
> Separate page required on Streaming K Means algorithm description and
> overview, explaining the various parameters can be used in streamingkmeans,
> strategy for parallelization, link to this paper:
> http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf
--
This message was sent by Atlassian JIRA
(v6.2#6252)