[ 
https://issues.apache.org/jira/browse/MAHOUT-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Eastman updated MAHOUT-3:
------------------------------

    Attachment: MAHOUT-3c.diff

A working implementation of a Canopy Clustering algorithm. See unit tests for 
the evolution of the user stories leading to the full implementation.

This implementation incorporates Ted Dunning's comments on my original 
approach to canopy generation. In particular, it does not rely upon emitting 
data 
during the close() operation of the CanopyMapper or CanopyReducer.
During the map phase, subsets of the input points are assigned to canopies
by each mapper and output to a combiner which then computes and outputs the
canopy centroids for each subset. During the reduce phase, the centroids are
again clustered into a final set of canopies which are output. 

This patch also incorporates Grant Ingersoll's comments on the name of the 
Canopy subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done 
from inside the project root.

NEW: This patch implements the actual clustering of the original points using
the canopy centers produced by the cluster generation phase.

TODO: Sort out the generics

TODO: Allow the CanopyReducer to take different (e.g. smaller) threshold values
so that canopy coalescing will not be so aggressive. 

TODO: Allow points to carry payloads for use by other subsystems, to be 
sparse, ...

All unit tests run.

- src/main/java/org/apache/mahout/clustering/canopy
  - Canopy.java
    (configure): sets the distance measure, t1 and t2 statics for subsequent
      operations. Assumes all canopies created by this class loader will
      have the same properties.
    (addPointToCanopies): applies the distance metric to all canopies,
      adding the point to those that are covered
    (emitPointToNewCanopies): same algorithm but used by CanopyMapper to
      output points with canopyIds to CanopyCombiner
    (emitPointToExistingCanopies): checks the distance and emits the point
      with each canopy definition as key. Emits the point to the closest
      canopy if canopy center clustering has moved the centroids so that 
      the point is slightly outside of an existing canopy.
    (addPoint): add a point to the pointTotals and bump numPoints
    (emitPoint): output the point to the collector thence to the combiner
    (getCenter): returns the canopy center
    (getNumPoints): returns the number of points in the canopy
    (getCanopyId): returns the canopyId
    (computeCentroid): normalizes the pointTotals with tne numPoints 
      to return a computed centroid for the canopy
    (formatPoint, decodePoint): encoding/decoding for points
    (formatCanopy, decodeCanopy): encoding/decoding for canopies
    (covers): returns if the point is covered by the canopy
    (ptOut, toString): utilities
  - CanopyDriver.java
    (main): the main program
    (runJob): static used by unit tests
  - CanopyMapper.java
    (map): the map function assigns points to canopies outputting each
      point to each of its canopies
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - CanopyCombiner.java
    (reduce): computes & writes the canopy centroids to the output using
      a single "centroid" key
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - CanopyReducer.java
    (reduce): the reduce function assigns points to canopies
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - ClusterMapper.java
    (map): the map function assigns points to existing canopies outputting 
      each point to each of its canopies
    (configure): reads distance measure and thresholds from job and
      configures Canopy. Also reads canopy definitions from produced by
      the CanopyReducer.    
  - ClusterDriver.java
    (main): the main program uses IdentityReducers
    (runJob): static used by unit tests
  - Job.java
    (main): the main program invokes CanopyDriver and ClusterDriver
    (runJob): static used by unit tests
  - DistanceMeasure.java
    (distance): compute the distance between two points by some measure
  - EuclideanDistanceMeasure.java
   (distance): comput the distance between two points by Euclidean measure
  - ManhattanDistanceMeasure.java
   (distance): comput the distance between two points by Manhattan measure
- src/test/java/org/apache/mahout/clustering/canopy
  - DummyOutputCollector.java
    (collect): collects output data in a map
    (getData): returns output data for unit tests
    (getKeys): returns the key set
    (getValue): returns the value associated with the key
  - VisibleCanopy.java
    (addPoint): overrides Canopy method to add point to a list
    (toString): overrides Canopy method to add point printout
  - TestCanopyCreation.java
    (setUp): uses published algorithm to initialize reference data
    (testReferenceManhattan, testReferenceEuclidean): validates reference data
    (testIterativeManhattan, testIterativeEuclidean): uses optimized
      algorithm and verifies result vs. reference data
    (testCanopyMapperManhattan, testCanopyMapperEuclidean,
     testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
      mapper/combiner and reducer with test data
    (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
      resulting canopy centroids
    (testClusterMapperManhattan, testClusterMapperEuclidean,
     testClusterReducerManhattan, testClusterReducerEuclidean): excercises
       mapper and reducer with test data, testing clustering correctness
    (testClusteringManhattanMR, testClusteringEuclideanMR): runs both
      canopy generation and clustering to print out results

> Build initial canopy clustering prototype
> -----------------------------------------
>
>                 Key: MAHOUT-3
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-3
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-3.diff, MAHOUT-3a.diff, MAHOUT-3b.diff, 
> MAHOUT-3c.diff
>
>
> I'd like to reserve some namespace, specifically 
> org.apache.mahout.clustering.canopy to use for an initial prototype of canopy 
> clustering. I'm going to start with a little unit test to get the basic 
> algorithm sorted out, then M/R it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to