[
https://issues.apache.org/jira/browse/MAHOUT-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Eastman updated MAHOUT-5:
------------------------------
Attachment: MAHOUT-5a.diff
A working implementation of a K-Means Clustering algorithm. See unit tests for
the evolution of the user stories leading to the full implementation. This
implementation
expects an input set of clusters (perhaps via canopy) and iterates them to
convergence
before the final pass which clusters the input points.
This patch also does some refactoring of the Canopy Clustering code, since there
were many common elements, so it is included in its entirety. See MAHOUT-3 for
commit log details for Canopy. I will tease them apart if needed.
TODO: Sort out the generics
TODO: Allow points to be sparse, ...
All unit tests run.
- src/main/java/org/apache/mahout/clustering/kmeans
- Cluster.java
(formatCluster, decodeCluster): cluster i/o formatting
(configure, config): configuration for jobs or unit tests
(emitPointToNearestCluster): the essence of kmeans, does what it says
(addPoint, addPoints): add one or more points to the cluster
(recomputeCenter, computeConvergence, isConverged): useful helpers
(toString, getCenter, getNumPoints, getPointTotal): accessors
- KMeansMapper.java
(configure, config): load the clusters for jobs or unit tests
(map): emits all points to nearest cluster
- KMeansCombiner.java
(configure): configuration for jobs
(reduce): computes partial totalCount values for points seen
by associated mapper. Outputs canopy key and numPoints, pointTotal to
reducer for final combination
- KMeansReducer.java
(configure): configuration for jobs
(reduce): computes new canopy centroid and determines convergence
- KMeansDriver.java
(runIteration): runs an iteration M/R job to map/combine/reduce clusters
and return convergence criteria
(runClustering): final map step assigns original points to converged
clusters. No reducer or combiner for this step.
(runJob): runs one or more iterations until convergence is achieved or
iteration limit is exceeded
- src/main/java/org/apache/mahout/clustering/utils
- DistanceMeasure.java: old friend from Canopy clustering
- EuclideanDistanceMeasure.java: old friend from Canopy clustering
- ManhattanDistanceMeasure.java: old friend from Canopy clustering
- Point.java: refactors useful operations on Float[] points
- src/test/java/org/apache/mahout/clustering/kmeans
- TestKmeansClustering.java
(referenceKmeans, iterateReference): reference implementation (thanks for
the code to look at Dawid)
(testReferenceImplementation, testKMeansMapper, testKMeansCombiner,
testKMeansReducer): test isolated components
(testKMeansMRJob): tests the job on all values of k for test points
- VisibleCluster.java
(addPoint, toString): overrides inherited methods for testing
> Implement a k-means clustering prototype
> -----------------------------------------
>
> Key: MAHOUT-5
> URL: https://issues.apache.org/jira/browse/MAHOUT-5
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.1
> Reporter: Jeff Eastman
> Assignee: Jeff Eastman
> Priority: Minor
> Attachments: kmeans.zip, MAHOUT-5a.diff
>
>
> K-means clustering is closely related to Canopy clustering and often uses
> canopies to determine the initial clusters. I'd like to implement a k-means
> prototype and tests in the package org.apache.mahout.clustering.kmeans.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.