[jira] Issue Comment Edited: (MAHOUT-3) Build initial canopy clustering prototype

Jeff Eastman (JIRA) Mon, 11 Feb 2008 17:29:17 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567897#action_12567897
 ]


jeastman edited comment on MAHOUT-3 at 2/11/08 4:49 PM:
------------------------------------------------------------

Improved implementation of Canopy generation phase of two-phase Canopy
Clustering algorithm. See unit tests for the evolution of the user
stories leading to the working implementation.

This implementation incorporates Ted Dunning's comments on my original approach.
In particular, it does not rely upon emitting data during the close() operation.
During the map phase, subsets of the input points are assigned to canopies
by each mapper and output to a combiner which then computes and outputs the
canopy centroids for each subset. During the reduce phase, the centroids are
again clustered into a final set of canopies which are output. 

This also incorporates Grant Ingersoll's comments on the name of the Canopy
subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from
inside the project root.

TODO: Implement the actual clustering of the original points using
the canopy centers produced by this implementation.

TODO: Sort out the generics

TODO: Allow points to be sparse, to carry payloads for use by other 
subsystems, ...

All unit tests run.

- src/main/java/org/apache/mahout/clustering/canopy
  - Canopy.java
    (configure): sets the distance measure, t1 and t2 statics for subsequent
      operations. Assumes all canopies created by this class loader will
      have the same properties.
    (addPointToCanopies): applies the distance metric to all canopies,
      adding the point to those that are covered
    (emitPointToCanopies): same algorithm but used by mapper to output
      points with canopyIds to CanopyCombiner
    (addPoint): add a point to the pointTotals and bump numPoints
    (emitPoint): output the point to the collector thence to the combiner
    (getCenter): returns the canopy center
    (getNumPoints): returns the number of points in the canopy
    (getCanopyId): returns the canopyId
    (computeCentroid): normalizes the pointTotals with tne numPoints 
      to return a computed centroid for the canopy
    (formatPoint, decodePoint): encoding/decoding for points
    (formatCanopy, decodeCanopy): encoding/decoding for canopies
    (ptOut, toString): utilities
  - CanopyDriver.java
    (main): the main program
    (runJob): static used by unit tests
  - CanopyMapper.java
    (map): the map function assigns points to canopies outputting each
      point to each of its canopies
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - CanopyCombiner.java
    (reduce): computes & writes the canopy centroids to the output using
      a single "centroid" key
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - CanopyReducer.java
    (reduce): the reduce function assigns points to canopies
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - DistanceMeasure.java
    (distance): compute the distance between two points by some measure
  - EuclideanDistanceMeasure.java
   (distance): comput the distance between two points by Euclidean measure
  - ManhattanDistanceMeasure.java
   (distance): comput the distance between two points by Manhattan measure
- src/test/java/org/apache/mahout/clustering/canopy
  - DummyOutputCollector.java
    (collect): collects output data in a map
    (getData): returns output data for unit tests
    (getKeys): returns the key set
    (getValue): returns the value associated with the key
  - VisibleCanopy.java
    (addPoint): overrides Canopy method to add point to a list
    (toString): overrides Canopy method to add point printout
  - TestCanopyCreation.java
    (setUp): uses published algorithm to initialize reference data
    (testReferenceManhattan, testReferenceEuclidean): validates reference data
    (testIterativeManhattan, testIterativeEuclidean): uses optimized
      algorithm and verifies result vs. reference data
    (testCanopyMapperManhattan, testCanopyMapperEuclidean,
     testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
      mapper/combiner and reducer with test data
    (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
      resulting canopy centroids

      was (Author: jeastman):
    Improved implementation of Canopy generation phase of two-phase Canopy
Clustering algorithm. See unit tests for the evolution of the user
stories leading to the working implementation.

This implementation incorporates Ted Dunning's comments on my original approach.
In particular, it does not rely upon emitting data during the close() operation.
During the map phase, subsets of the input points are assigned to canopies
by each mapper and output to a combiner which then computes and outputs the
canopy centroids for each subset. During the reduce phase, the centroids are
again clustered into a final set of canopies which are output. 

This also incorporates Grant Ingersoll's comments on the name of the Canopy
subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from
inside the project root.

TODO: Implement the actual clustering of the original points using
the canopy centers produced by this implementation.

TODO: Sort out the generics

TODO: Allow points to be sparse, to carry payloads for use by other 
subsystems, ...

All unit tests run.

- src/main/java/org/apache/mahout/clustering/canopy
  - Canopy.java
    (configure): sets the distance measure, t1 and t2 statics for subsequent
      operations. Assumes all canopies created by this class loader will
      have the same properties.
    (addPointToCanopies): applies the distance metric to all canopies,
      adding the point to those that are covered
    (emitPointToCanopies): same algorithm but used by mapper to output
      points with canopyIds to CanopyCombiner
    (addPoint): add a point to the pointTotals and bump numPoints
    (emitPoint): output the point to the collector thence to the combiner
    (getCenter): returns the canopy center
    (getNumPoints): returns the number of points in the canopy
    (getCanopyId): returns the canopyId
    (computeCentroid): normalizes the pointTotals with tne numPoints 
      to return a computed centroid for the canopy
    (formatPoint, decodePoint): encoding/decoding for points
    (formatCanopy, decodeCanopy): encoding/decoding for canopies
    (ptOut, toString): utilities
  - CanopyDriver.java
    (main): the main program
    (runJob): static used by unit tests
  - CanopyMapper.java
    (map): the map function assigns points to canopies outputting each
      point to each of its canopies
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - CanopyCombiner.java
    (reduce): computes & writes the canopy centroids to the output using
      a single "centroid" key
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - CanopyReducer.java
    (reduce): the reduce function assigns points to canopies
    (configure): reads distance measure and thresholds from job and
      configures Canopy.
  - DistanceMeasure.java
    (distance): compute the distance between two points by some measure
  - EuclideanDistanceMeasure.java
   (distance): comput the distance between two points by Euclidean measure
  - ManhattanDistanceMeasure.java
   (distance): comput the distance between two points by Manhattan measure
- src/test/java/org/apache/mahout/clustering/canopy
  - DummyOutputCollector.java
    (collect): collects output data in a map
    (getData): returns output data for unit tests
    (getKeys): returns the key set
    (getValue): returns the value associated with the key
  - TestCanopy.java
    (addPoint): overrides Canopy method to add point to a list
    (toString): overrides Canopy method to add point printout
  - TestCanopyCreation.java
    (setUp): uses published algorithm to initialize reference data
    (testReferenceManhattan, testReferenceEuclidean): validates reference data
    (testIterativeManhattan, testIterativeEuclidean): uses optimized
      algorithm and verifies result vs. reference data
    (testCanopyMapperManhattan, testCanopyMapperEuclidean,
     testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
      mapper/combiner and reducer with test data
    (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
      resulting canopy centroids
  
> Build initial canopy clustering prototype
> -----------------------------------------
>
>                 Key: MAHOUT-3
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-3
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-3.diff, MAHOUT-3a.diff, MAHOUT-3b.diff
>
>
> I'd like to reserve some namespace, specifically 
> org.apache.mahout.clustering.canopy to use for an initial prototype of canopy 
> clustering. I'm going to start with a little unit test to get the basic 
> algorithm sorted out, then M/R it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-3) Build initial canopy clustering prototype

Reply via email to