[Part 2]

What still needs to be done for this version is:
- from a code perspective:
  - figure out decent SearchQualityTests
  - add configs to driver.classes.default.props, add a skmeans.props file

- from a quality perspective:
  - compare the resulting cluster quality on 20 newsgroups (this is
the dataset I talked with Ted about using); there is a preliminary
version of the results that is very encouraging but it uses the entire
dataset as training. I'm actually not 100% sure as how evaluating on
the test set would work, but that's for a separate thread.
  - compare the runtime on a cluster (there are some preliminary
results but they're on just one machine)

The code itself (from what you've seen Ted's initial work) is
essentially the same (with the bugs fixed :). The completely new
classes are the ones doing the MapReduce and most of the tests and
tools.

Where do we go from here? Do I open JIRA issues for the changes? Do I
first merge changes to the existing Mahout classes?
Thoughs? Any changes?

All the best,
Dan

On Mon, Mar 4, 2013 at 11:08 AM, Dan Filimon
<[email protected]> wrote:
> [Part 1]
>
> Hello everyone,
>
> We talked about getting the streaming k-means code ready for
> integration at least in a first attempt by 0.8-RC.
>
> Let me first walk you through the additions.
> The main classes are in o.a.m.clustering.streaming [1], under the
> core/ project. These are subdivided into 3 packages:
>
> - cluster: contains the BallKMeans and StreamingKMeans classes that
> can be used standalone.
>   BallKMeans is exactly what it sounds like (uses k-means++ for the
> initialization, then does a normal k-means pass and ignoring
> outilers).
>   StreamingKMeans implements the online clustering that doesn't return
> exactly k clusters, (it returns an estimate). This is used to
> approximate the data.
>
> - mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
> StreamingKMeansMapper and StreamingKMeansReducer classes.
>   CentroidWritable serializes Centroids (sort of like AbstractCluster).
>   StreamingKMeansDriver provides the driver for the job.
>   StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
> sketches of the data for the reducer.
>   StreamingKMeansReducer collects the centroids produced by the
> mappers into one set of weighted points and runs BallKMeans on them
> producing the final results.
>
> - search: various searcher classes that implement nearest-neighbor
> search using different strategies.
>   Searcher, UpdatableSearcher: abstract classes that define how to
> search through collections of vectors.
>   BruteSearch: does a brute search (looks at every point...)
>   ProjectionSearch: uses random projections for searching.
>   FastProjectionSearch: also uses random projections (but not binary
> search trees as in ProjectionSearch).
>   HashedVector, LocalitySensitiveHashSearch: implement locality
> sensitive hash search.
>
> All the tools that I used are in o.a.m.clustering.streaming [2], under
> the examples/ project.
> There are a bunch of classes here, covering everything from
> vectorizing 20 newsgroups data to various IO utils. The more important
> ones are:
>   utils.ExperimentUtils: convenience methods.
>   tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.
>
> [1] 
> https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
> [2] 
> https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming

Reply via email to