Well done!
On 2/11/08 4:49 PM, "Jeff Eastman (JIRA)" <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-3?page=com.atlassian.jira.plugin. > system.issuetabpanels:comment-tabpanel&focusedCommentId=12567897#action_125678 > 97 ] > > jeastman edited comment on MAHOUT-3 at 2/11/08 4:49 PM: > ------------------------------------------------------------ > > Improved implementation of Canopy generation phase of two-phase Canopy > Clustering algorithm. See unit tests for the evolution of the user > stories leading to the working implementation. > > This implementation incorporates Ted Dunning's comments on my original > approach. > In particular, it does not rely upon emitting data during the close() > operation. > During the map phase, subsets of the input points are assigned to canopies > by each mapper and output to a combiner which then computes and outputs the > canopy centroids for each subset. During the reduce phase, the centroids are > again clustered into a final set of canopies which are output. > > This also incorporates Grant Ingersoll's comments on the name of the Canopy > subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from > inside the project root. > > TODO: Implement the actual clustering of the original points using > the canopy centers produced by this implementation. > > TODO: Sort out the generics > > TODO: Allow points to be sparse, to carry payloads for use by other > subsystems, ... > > All unit tests run. > > - src/main/java/org/apache/mahout/clustering/canopy > - Canopy.java > (configure): sets the distance measure, t1 and t2 statics for subsequent > operations. Assumes all canopies created by this class loader will > have the same properties. > (addPointToCanopies): applies the distance metric to all canopies, > adding the point to those that are covered > (emitPointToCanopies): same algorithm but used by mapper to output > points with canopyIds to CanopyCombiner > (addPoint): add a point to the pointTotals and bump numPoints > (emitPoint): output the point to the collector thence to the combiner > (getCenter): returns the canopy center > (getNumPoints): returns the number of points in the canopy > (getCanopyId): returns the canopyId > (computeCentroid): normalizes the pointTotals with tne numPoints > to return a computed centroid for the canopy > (formatPoint, decodePoint): encoding/decoding for points > (formatCanopy, decodeCanopy): encoding/decoding for canopies > (ptOut, toString): utilities > - CanopyDriver.java > (main): the main program > (runJob): static used by unit tests > - CanopyMapper.java > (map): the map function assigns points to canopies outputting each > point to each of its canopies > (configure): reads distance measure and thresholds from job and > configures Canopy. > - CanopyCombiner.java > (reduce): computes & writes the canopy centroids to the output using > a single "centroid" key > (configure): reads distance measure and thresholds from job and > configures Canopy. > - CanopyReducer.java > (reduce): the reduce function assigns points to canopies > (configure): reads distance measure and thresholds from job and > configures Canopy. > - DistanceMeasure.java > (distance): compute the distance between two points by some measure > - EuclideanDistanceMeasure.java > (distance): comput the distance between two points by Euclidean measure > - ManhattanDistanceMeasure.java > (distance): comput the distance between two points by Manhattan measure > - src/test/java/org/apache/mahout/clustering/canopy > - DummyOutputCollector.java > (collect): collects output data in a map > (getData): returns output data for unit tests > (getKeys): returns the key set > (getValue): returns the value associated with the key > - VisibleCanopy.java > (addPoint): overrides Canopy method to add point to a list > (toString): overrides Canopy method to add point printout > - TestCanopyCreation.java > (setUp): uses published algorithm to initialize reference data > (testReferenceManhattan, testReferenceEuclidean): validates reference data > (testIterativeManhattan, testIterativeEuclidean): uses optimized > algorithm and verifies result vs. reference data > (testCanopyMapperManhattan, testCanopyMapperEuclidean, > testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises > mapper/combiner and reducer with test data > (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies > resulting canopy centroids > > was (Author: jeastman): > Improved implementation of Canopy generation phase of two-phase Canopy > Clustering algorithm. See unit tests for the evolution of the user > stories leading to the working implementation. > > This implementation incorporates Ted Dunning's comments on my original > approach. > In particular, it does not rely upon emitting data during the close() > operation. > During the map phase, subsets of the input points are assigned to canopies > by each mapper and output to a combiner which then computes and outputs the > canopy centroids for each subset. During the reduce phase, the centroids are > again clustered into a final set of canopies which are output. > > This also incorporates Grant Ingersoll's comments on the name of the Canopy > subclass (now VisibleCanopy vs. TestCanopy) and the .diff file is done from > inside the project root. > > TODO: Implement the actual clustering of the original points using > the canopy centers produced by this implementation. > > TODO: Sort out the generics > > TODO: Allow points to be sparse, to carry payloads for use by other > subsystems, ... > > All unit tests run. > > - src/main/java/org/apache/mahout/clustering/canopy > - Canopy.java > (configure): sets the distance measure, t1 and t2 statics for subsequent > operations. Assumes all canopies created by this class loader will > have the same properties. > (addPointToCanopies): applies the distance metric to all canopies, > adding the point to those that are covered > (emitPointToCanopies): same algorithm but used by mapper to output > points with canopyIds to CanopyCombiner > (addPoint): add a point to the pointTotals and bump numPoints > (emitPoint): output the point to the collector thence to the combiner > (getCenter): returns the canopy center > (getNumPoints): returns the number of points in the canopy > (getCanopyId): returns the canopyId > (computeCentroid): normalizes the pointTotals with tne numPoints > to return a computed centroid for the canopy > (formatPoint, decodePoint): encoding/decoding for points > (formatCanopy, decodeCanopy): encoding/decoding for canopies > (ptOut, toString): utilities > - CanopyDriver.java > (main): the main program > (runJob): static used by unit tests > - CanopyMapper.java > (map): the map function assigns points to canopies outputting each > point to each of its canopies > (configure): reads distance measure and thresholds from job and > configures Canopy. > - CanopyCombiner.java > (reduce): computes & writes the canopy centroids to the output using > a single "centroid" key > (configure): reads distance measure and thresholds from job and > configures Canopy. > - CanopyReducer.java > (reduce): the reduce function assigns points to canopies > (configure): reads distance measure and thresholds from job and > configures Canopy. > - DistanceMeasure.java > (distance): compute the distance between two points by some measure > - EuclideanDistanceMeasure.java > (distance): comput the distance between two points by Euclidean measure > - ManhattanDistanceMeasure.java > (distance): comput the distance between two points by Manhattan measure > - src/test/java/org/apache/mahout/clustering/canopy > - DummyOutputCollector.java > (collect): collects output data in a map > (getData): returns output data for unit tests > (getKeys): returns the key set > (getValue): returns the value associated with the key > - TestCanopy.java > (addPoint): overrides Canopy method to add point to a list > (toString): overrides Canopy method to add point printout > - TestCanopyCreation.java > (setUp): uses published algorithm to initialize reference data > (testReferenceManhattan, testReferenceEuclidean): validates reference data > (testIterativeManhattan, testIterativeEuclidean): uses optimized > algorithm and verifies result vs. reference data > (testCanopyMapperManhattan, testCanopyMapperEuclidean, > testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises > mapper/combiner and reducer with test data > (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies > resulting canopy centroids > >> Build initial canopy clustering prototype >> ----------------------------------------- >> >> Key: MAHOUT-3 >> URL: https://issues.apache.org/jira/browse/MAHOUT-3 >> Project: Mahout >> Issue Type: New Feature >> Reporter: Jeff Eastman >> Attachments: MAHOUT-3.diff, MAHOUT-3a.diff, MAHOUT-3b.diff >> >> >> I'd like to reserve some namespace, specifically >> org.apache.mahout.clustering.canopy to use for an initial prototype of canopy >> clustering. I'm going to start with a little unit test to get the basic >> algorithm sorted out, then M/R it.
