----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10193/ -----------------------------------------------------------
(Updated March 29, 2013, 1:42 p.m.) Review request for mahout, Ted Dunning and Sebastian Schelter. Changes ------- Mentioned the review request containing the BallKMeans and StreamingKMeans classes. Description (updated) ------- This depends (loosely) on https://reviews.apache.org/r/10194/ This patch implements the MapReduce version of StreamingKMeans for MAHOUT-1154. It adds 5 new classes: - CentroidWritable: class representing a centroid that can be written to a SeqFile - StreamingKMeansDriver: class implementing AbstractJob that is the entry point to the mapreduction - StreamingKMeansMapper: mapper, running StreamingKMeans (see MAHOUT-1162) clustering the points one by one - StreamingKMeansReducer: reducer, running BallKMeans (see MAHOUT-1162) a number of times and picking the clustering with the lowest total clustering cost. The cost is determined by randomly splitting the incoming centroids into a "training" and "test" set, computing the centroids on the training set and the cost on the test set. The intent is to see whether the centroids actually describe the distribution of the points or not. - StreamingKMeansUtilMR: helper class with a method to instantiate a searcher from a Configuration. Additionally, there is a test class StreamingKMeansTestMR that tests the mapper, reducer and mapper and reducer together using MRUnit. !!! Since MRUnit is now a dependency, the core pom.xml file adds MRUnit as a dependency. We depend on snapshot 1.0 which is not yet released (it will be very soon), hence the updated pom.xml is not provided for now. !!! Diffs ----- core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/CentroidWritable.java PRE-CREATION core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java PRE-CREATION core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java PRE-CREATION core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansReducer.java PRE-CREATION core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansUtilsMR.java PRE-CREATION core/src/test/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansTestMR.java PRE-CREATION src/conf/driver.classes.default.props ac45eef Diff: https://reviews.apache.org/r/10193/diff/ Testing ------- See StreamingKMeansTestMR for the tests. These are all performed on data sample from a "hypercube" distribution (there are multinormal distributions in each vertex of the cube). Additionally there are ongoing tests on the 20 newsgroups data set (and some more are on the way). Thanks, Dan Filimon