[Part 2] What still needs to be done for this version is: - from a code perspective: - figure out decent SearchQualityTests - add configs to driver.classes.default.props, add a skmeans.props file
- from a quality perspective: - compare the resulting cluster quality on 20 newsgroups (this is the dataset I talked with Ted about using); there is a preliminary version of the results that is very encouraging but it uses the entire dataset as training. I'm actually not 100% sure as how evaluating on the test set would work, but that's for a separate thread. - compare the runtime on a cluster (there are some preliminary results but they're on just one machine) The code itself (from what you've seen Ted's initial work) is essentially the same (with the bugs fixed :). The completely new classes are the ones doing the MapReduce and most of the tests and tools. Where do we go from here? Do I open JIRA issues for the changes? Do I first merge changes to the existing Mahout classes? Thoughs? Any changes? All the best, Dan On Mon, Mar 4, 2013 at 11:08 AM, Dan Filimon <[email protected]> wrote: > [Part 1] > > Hello everyone, > > We talked about getting the streaming k-means code ready for > integration at least in a first attempt by 0.8-RC. > > Let me first walk you through the additions. > The main classes are in o.a.m.clustering.streaming [1], under the > core/ project. These are subdivided into 3 packages: > > - cluster: contains the BallKMeans and StreamingKMeans classes that > can be used standalone. > BallKMeans is exactly what it sounds like (uses k-means++ for the > initialization, then does a normal k-means pass and ignoring > outilers). > StreamingKMeans implements the online clustering that doesn't return > exactly k clusters, (it returns an estimate). This is used to > approximate the data. > > - mapreduce: contains the CentroidWritable, StreamingKMeansDriver, > StreamingKMeansMapper and StreamingKMeansReducer classes. > CentroidWritable serializes Centroids (sort of like AbstractCluster). > StreamingKMeansDriver provides the driver for the job. > StreamingKMeansMapper runs StreamingKMeans in the mappers to produce > sketches of the data for the reducer. > StreamingKMeansReducer collects the centroids produced by the > mappers into one set of weighted points and runs BallKMeans on them > producing the final results. > > - search: various searcher classes that implement nearest-neighbor > search using different strategies. > Searcher, UpdatableSearcher: abstract classes that define how to > search through collections of vectors. > BruteSearch: does a brute search (looks at every point...) > ProjectionSearch: uses random projections for searching. > FastProjectionSearch: also uses random projections (but not binary > search trees as in ProjectionSearch). > HashedVector, LocalitySensitiveHashSearch: implement locality > sensitive hash search. > > All the tools that I used are in o.a.m.clustering.streaming [2], under > the examples/ project. > There are a bunch of classes here, covering everything from > vectorizing 20 newsgroups data to various IO utils. The more important > ones are: > utils.ExperimentUtils: convenience methods. > tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. > > [1] > https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming > [2] > https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming
