On Mar 4, 2013, at 4:19 AM, Dan Filimon wrote: > [Part 2] > > What still needs to be done for this version is: > - from a code perspective: > - figure out decent SearchQualityTests > - add configs to driver.classes.default.props, add a skmeans.props file > > - from a quality perspective: > - compare the resulting cluster quality on 20 newsgroups (this is > the dataset I talked with Ted about using); there is a preliminary > version of the results that is very encouraging but it uses the entire > dataset as training. I'm actually not 100% sure as how evaluating on > the test set would work, but that's for a separate thread. > - compare the runtime on a cluster (there are some preliminary > results but they're on just one machine) > > The code itself (from what you've seen Ted's initial work) is > essentially the same (with the bugs fixed :). The completely new > classes are the ones doing the MapReduce and most of the tests and > tools. > > Where do we go from here? Do I open JIRA issues for the changes? Do I > first merge changes to the existing Mahout classes?
I believe there is a JIRA already open for it (if not, open one). A patch that can be applied to trunk/master with all tests passing would be best. Any patch that more or less shows what is done is also welcome, although it is a bit harder to consume. > Thoughs? Any changes? > > All the best, > Dan > > On Mon, Mar 4, 2013 at 11:08 AM, Dan Filimon > <[email protected]> wrote: >> [Part 1] >> >> Hello everyone, >> >> We talked about getting the streaming k-means code ready for >> integration at least in a first attempt by 0.8-RC. >> >> Let me first walk you through the additions. >> The main classes are in o.a.m.clustering.streaming [1], under the >> core/ project. These are subdivided into 3 packages: >> >> - cluster: contains the BallKMeans and StreamingKMeans classes that >> can be used standalone. >> BallKMeans is exactly what it sounds like (uses k-means++ for the >> initialization, then does a normal k-means pass and ignoring >> outilers). >> StreamingKMeans implements the online clustering that doesn't return >> exactly k clusters, (it returns an estimate). This is used to >> approximate the data. >> >> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver, >> StreamingKMeansMapper and StreamingKMeansReducer classes. >> CentroidWritable serializes Centroids (sort of like AbstractCluster). >> StreamingKMeansDriver provides the driver for the job. >> StreamingKMeansMapper runs StreamingKMeans in the mappers to produce >> sketches of the data for the reducer. >> StreamingKMeansReducer collects the centroids produced by the >> mappers into one set of weighted points and runs BallKMeans on them >> producing the final results. >> >> - search: various searcher classes that implement nearest-neighbor >> search using different strategies. >> Searcher, UpdatableSearcher: abstract classes that define how to >> search through collections of vectors. >> BruteSearch: does a brute search (looks at every point...) >> ProjectionSearch: uses random projections for searching. >> FastProjectionSearch: also uses random projections (but not binary >> search trees as in ProjectionSearch). >> HashedVector, LocalitySensitiveHashSearch: implement locality >> sensitive hash search. >> >> All the tools that I used are in o.a.m.clustering.streaming [2], under >> the examples/ project. >> There are a bunch of classes here, covering everything from >> vectorizing 20 newsgroups data to various IO utils. The more important >> ones are: >> utils.ExperimentUtils: convenience methods. >> tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths. >> >> [1] >> https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming >> [2] >> https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming -------------------------------------------- Grant Ingersoll | @gsingers http://www.lucidworks.com
