Dan Filimon created MAHOUT-1154:
-----------------------------------

             Summary: Implementing Streaming KMeans
                 Key: MAHOUT-1154
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1154
             Project: Mahout
          Issue Type: New Feature
          Components: Clustering
    Affects Versions: 0.8
            Reporter: Dan Filimon


An implementation of Streaming KMeans as mentioned in [1] is available here [2].

[1]http://mail-archives.apache.org/mod_mbox/mahout-dev/201303.mbox/%3ccaowb3goyf9zufrgxhsucpkjxk6cw0nnr8gwg__jsey+kvab...@mail.gmail.com%3E
[2] https://github.com/dfilimon/mahout

Since there will be more than one patches, there will be specific JIRA issues 
that address each one.

The description of the code being added is:

The main classes are in o.a.m.clustering.streaming [1], under the
core/ project. These are subdivided into 3 packages:

- cluster: contains the BallKMeans and StreamingKMeans classes that
can be used standalone.
  BallKMeans is exactly what it sounds like (uses k-means++ for the
initialization, then does a normal k-means pass and ignoring
outilers).
  StreamingKMeans implements the online clustering that doesn't return
exactly k clusters, (it returns an estimate). This is used to
approximate the data.

- mapreduce: contains the CentroidWritable, StreamingKMeansDriver,
StreamingKMeansMapper and StreamingKMeansReducer classes.
  CentroidWritable serializes Centroids (sort of like AbstractCluster).
  StreamingKMeansDriver provides the driver for the job.
  StreamingKMeansMapper runs StreamingKMeans in the mappers to produce
sketches of the data for the reducer.
  StreamingKMeansReducer collects the centroids produced by the
mappers into one set of weighted points and runs BallKMeans on them
producing the final results.

- search: various searcher classes that implement nearest-neighbor
search using different strategies.
  Searcher, UpdatableSearcher: abstract classes that define how to
search through collections of vectors.
  BruteSearch: does a brute search (looks at every point...)
  ProjectionSearch: uses random projections for searching.
  FastProjectionSearch: also uses random projections (but not binary
search trees as in ProjectionSearch).
  HashedVector, LocalitySensitiveHashSearch: implement locality
sensitive hash search.

All the tools that I used are in o.a.m.clustering.streaming [2], under
the examples/ project.
There are a bunch of classes here, covering everything from
vectorizing 20 newsgroups data to various IO utils. The more important
ones are:
  utils.ExperimentUtils: convenience methods.
  tools.ClusterQuality20NewsGroups: actual experiment, with hardcoded paths.

[3] 
https://github.com/dfilimon/mahout/tree/skm/core/src/main/java/org/apache/mahout/clustering/streaming
[4] 
https://github.com/dfilimon/mahout/tree/skm/examples/src/main/java/org/apache/mahout/clustering/streaming

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to