On Mar 4, 2013, at 4:19 AM, Dan Filimon wrote:

> [Part 2]
> 
> What still needs to be done for this version is:
> - from a code perspective:
>  - figure out decent SearchQualityTests
>  - add configs to driver.classes.default.props, add a skmeans.props file
> 
> - from a quality perspective:
>  - compare the resulting cluster quality on 20 newsgroups (this is
> the dataset I talked with Ted about using); there is a preliminary
> version of the results that is very encouraging but it uses the entire
> dataset as training. I'm actually not 100% sure as how evaluating on
> the test set would work, but that's for a separate thread.
>  - compare the runtime on a cluster (there are some preliminary
> results but they're on just one machine)
> 
> The code itself (from what you've seen Ted's initial work) is
> essentially the same (with the bugs fixed :). The completely new
> classes are the ones doing the MapReduce and most of the tests and
> tools.
> 
> Where do we go from here? Do I open JIRA issues for the changes? Do I
> first merge changes to the existing Mahout classes?

I believe there is a JIRA already open for it (if not, open one).  A patch that 
can be applied to trunk/master with all tests passing would be best.  Any patch 
that more or less shows what is done is also welcome, although it is a bit 
harder to consume.

> Thoughs? Any changes?
> 
> All the best,
> Dan
> 
> On Mon, Mar 4, 2013 at 11:08 AM, Dan Filimon
> <[email protected]> wrote:
>> [Part 1]
>> 
>> Hello everyone,
>> 
>> We talked about getting the streaming k-means code ready for
>> integration at least in a first attempt by 0.8-RC.
>> 
>> Let me first walk you through the additions.
>> The main classes are in o.a.m.clustering.streaming [1], under the
>> core/ project. These are subdivided into 3 packages:
>> 
>> - cluster: contains the BallKMeans and StreamingKMeans classes that
>> can be used standalone.
>>  BallKMeans is exactly what it sounds like (uses k-means++ for the
>> initialization, then does a normal k-means pass and ignoring
>> outilers).
>>  StreamingKMeans implements the online clustering that doesn't return
>> exactly k clusters, (it returns an estimate). This is used to
>> approximate the data.
>> 
>> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
>> StreamingKMeansMapper and StreamingKMeansReducer classes.
>>  CentroidWritable serializes Centroids (sort of like AbstractCluster).
>>  StreamingKMeansDriver provides the driver for the job.
>>  StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
>> sketches of the data for the reducer.
>>  StreamingKMeansReducer collects the centroids produced by the
>> mappers into one set of weighted points and runs BallKMeans on them
>> producing the final results.
>> 
>> - search: various searcher classes that implement nearest-neighbor
>> search using different strategies.
>>  Searcher, UpdatableSearcher: abstract classes that define how to
>> search through collections of vectors.
>>  BruteSearch: does a brute search (looks at every point...)
>>  ProjectionSearch: uses random projections for searching.
>>  FastProjectionSearch: also uses random projections (but not binary
>> search trees as in ProjectionSearch).
>>  HashedVector, LocalitySensitiveHashSearch: implement locality
>> sensitive hash search.
>> 
>> All the tools that I used are in o.a.m.clustering.streaming [2], under
>> the examples/ project.
>> There are a bunch of classes here, covering everything from
>> vectorizing 20 newsgroups data to various IO utils. The more important
>> ones are:
>>  utils.ExperimentUtils: convenience methods.
>>  tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
>> 
>> [1] 
>> https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
>> [2] 
>> https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com





Reply via email to