Re: [jira] Issue Comment Edited: (MAHOUT-3) Build initial canopy clustering prototype

Ted Dunning Mon, 11 Feb 2008 17:32:25 -0800


Well done!



On 2/11/08 4:49 PM, "Jeff Eastman (JIRA)" <[EMAIL PROTECTED]> wrote:

> 
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-3?page=com.atlassian.jira.plugin.
> system.issuetabpanels:comment-tabpanel&focusedCommentId=12567897#action_125678
> 97 ] 
> 
> jeastman edited comment on MAHOUT-3 at 2/11/08 4:49 PM:
> ------------------------------------------------------------
> 
> Improved implementation of Canopy generation phase of two-phase Canopy
> Clustering algorithm. See unit tests for the evolution of the user
> stories leading to the working implementation.
> 
> This implementation incorporates Ted Dunning's comments on my original
> approach.
> In particular, it does not rely upon emitting data during the close()
> operation.
> During the map phase, subsets of the input points are assigned to canopies
> by each mapper and output to a combiner which then computes and outputs the
> canopy centroids for each subset. During the reduce phase, the centroids are
> again clustered into a final set of canopies which are output.
> 
> This also incorporates Grant Ingersoll's comments on the name of the Canopy
> subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from
> inside the project root.
> 
> TODO: Implement the actual clustering of the original points using
> the canopy centers produced by this implementation.
> 
> TODO: Sort out the generics
> 
> TODO: Allow points to be sparse, to carry payloads for use by other
> subsystems, ...
> 
> All unit tests run.
> 
> - src/main/java/org/apache/mahout/clustering/canopy
>   - Canopy.java
>     (configure): sets the distance measure, t1 and t2 statics for subsequent
>       operations. Assumes all canopies created by this class loader will
>       have the same properties.
>     (addPointToCanopies): applies the distance metric to all canopies,
>       adding the point to those that are covered
>     (emitPointToCanopies): same algorithm but used by mapper to output
>       points with canopyIds to CanopyCombiner
>     (addPoint): add a point to the pointTotals and bump numPoints
>     (emitPoint): output the point to the collector thence to the combiner
>     (getCenter): returns the canopy center
>     (getNumPoints): returns the number of points in the canopy
>     (getCanopyId): returns the canopyId
>     (computeCentroid): normalizes the pointTotals with tne numPoints
>       to return a computed centroid for the canopy
>     (formatPoint, decodePoint): encoding/decoding for points
>     (formatCanopy, decodeCanopy): encoding/decoding for canopies
>     (ptOut, toString): utilities
>   - CanopyDriver.java
>     (main): the main program
>     (runJob): static used by unit tests
>   - CanopyMapper.java
>     (map): the map function assigns points to canopies outputting each
>       point to each of its canopies
>     (configure): reads distance measure and thresholds from job and
>       configures Canopy.
>   - CanopyCombiner.java
>     (reduce): computes & writes the canopy centroids to the output using
>       a single "centroid" key
>     (configure): reads distance measure and thresholds from job and
>       configures Canopy.
>   - CanopyReducer.java
>     (reduce): the reduce function assigns points to canopies
>     (configure): reads distance measure and thresholds from job and
>       configures Canopy.
>   - DistanceMeasure.java
>     (distance): compute the distance between two points by some measure
>   - EuclideanDistanceMeasure.java
>    (distance): comput the distance between two points by Euclidean measure
>   - ManhattanDistanceMeasure.java
>    (distance): comput the distance between two points by Manhattan measure
> - src/test/java/org/apache/mahout/clustering/canopy
>   - DummyOutputCollector.java
>     (collect): collects output data in a map
>     (getData): returns output data for unit tests
>     (getKeys): returns the key set
>     (getValue): returns the value associated with the key
>   - VisibleCanopy.java
>     (addPoint): overrides Canopy method to add point to a list
>     (toString): overrides Canopy method to add point printout
>   - TestCanopyCreation.java
>     (setUp): uses published algorithm to initialize reference data
>     (testReferenceManhattan, testReferenceEuclidean): validates reference data
>     (testIterativeManhattan, testIterativeEuclidean): uses optimized
>       algorithm and verifies result vs. reference data
>     (testCanopyMapperManhattan, testCanopyMapperEuclidean,
>      testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
>       mapper/combiner and reducer with test data
>     (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
>       resulting canopy centroids
> 
>       was (Author: jeastman):
>     Improved implementation of Canopy generation phase of two-phase Canopy
> Clustering algorithm. See unit tests for the evolution of the user
> stories leading to the working implementation.
> 
> This implementation incorporates Ted Dunning's comments on my original
> approach.
> In particular, it does not rely upon emitting data during the close()
> operation.
> During the map phase, subsets of the input points are assigned to canopies
> by each mapper and output to a combiner which then computes and outputs the
> canopy centroids for each subset. During the reduce phase, the centroids are
> again clustered into a final set of canopies which are output.
> 
> This also incorporates Grant Ingersoll's comments on the name of the Canopy
> subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from
> inside the project root.
> 
> TODO: Implement the actual clustering of the original points using
> the canopy centers produced by this implementation.
> 
> TODO: Sort out the generics
> 
> TODO: Allow points to be sparse, to carry payloads for use by other
> subsystems, ...
> 
> All unit tests run.
> 
> - src/main/java/org/apache/mahout/clustering/canopy
>   - Canopy.java
>     (configure): sets the distance measure, t1 and t2 statics for subsequent
>       operations. Assumes all canopies created by this class loader will
>       have the same properties.
>     (addPointToCanopies): applies the distance metric to all canopies,
>       adding the point to those that are covered
>     (emitPointToCanopies): same algorithm but used by mapper to output
>       points with canopyIds to CanopyCombiner
>     (addPoint): add a point to the pointTotals and bump numPoints
>     (emitPoint): output the point to the collector thence to the combiner
>     (getCenter): returns the canopy center
>     (getNumPoints): returns the number of points in the canopy
>     (getCanopyId): returns the canopyId
>     (computeCentroid): normalizes the pointTotals with tne numPoints
>       to return a computed centroid for the canopy
>     (formatPoint, decodePoint): encoding/decoding for points
>     (formatCanopy, decodeCanopy): encoding/decoding for canopies
>     (ptOut, toString): utilities
>   - CanopyDriver.java
>     (main): the main program
>     (runJob): static used by unit tests
>   - CanopyMapper.java
>     (map): the map function assigns points to canopies outputting each
>       point to each of its canopies
>     (configure): reads distance measure and thresholds from job and
>       configures Canopy.
>   - CanopyCombiner.java
>     (reduce): computes & writes the canopy centroids to the output using
>       a single "centroid" key
>     (configure): reads distance measure and thresholds from job and
>       configures Canopy.
>   - CanopyReducer.java
>     (reduce): the reduce function assigns points to canopies
>     (configure): reads distance measure and thresholds from job and
>       configures Canopy.
>   - DistanceMeasure.java
>     (distance): compute the distance between two points by some measure
>   - EuclideanDistanceMeasure.java
>    (distance): comput the distance between two points by Euclidean measure
>   - ManhattanDistanceMeasure.java
>    (distance): comput the distance between two points by Manhattan measure
> - src/test/java/org/apache/mahout/clustering/canopy
>   - DummyOutputCollector.java
>     (collect): collects output data in a map
>     (getData): returns output data for unit tests
>     (getKeys): returns the key set
>     (getValue): returns the value associated with the key
>   - TestCanopy.java
>     (addPoint): overrides Canopy method to add point to a list
>     (toString): overrides Canopy method to add point printout
>   - TestCanopyCreation.java
>     (setUp): uses published algorithm to initialize reference data
>     (testReferenceManhattan, testReferenceEuclidean): validates reference data
>     (testIterativeManhattan, testIterativeEuclidean): uses optimized
>       algorithm and verifies result vs. reference data
>     (testCanopyMapperManhattan, testCanopyMapperEuclidean,
>      testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
>       mapper/combiner and reducer with test data
>     (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
>       resulting canopy centroids
>   
>> Build initial canopy clustering prototype
>> -----------------------------------------
>> 
>>                 Key: MAHOUT-3
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-3
>>             Project: Mahout
>>          Issue Type: New Feature
>>            Reporter: Jeff Eastman
>>         Attachments: MAHOUT-3.diff, MAHOUT-3a.diff, MAHOUT-3b.diff
>> 
>> 
>> I'd like to reserve some namespace, specifically
>> org.apache.mahout.clustering.canopy to use for an initial prototype of canopy
>> clustering. I'm going to start with a little unit test to get the basic
>> algorithm sorted out, then M/R it.

Re: [jira] Issue Comment Edited: (MAHOUT-3) Build initial canopy clustering prototype

Reply via email to