[
https://issues.apache.org/jira/browse/MAHOUT-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Eastman updated MAHOUT-3:
------------------------------
Attachment: MAHOUT-3a.diff
Initial implementation of Canopy generation phase of two-phase Canopy
Clustering algorithm. See unit tests for the evolution of the user
stories leading to the working implementation.
TODO: Implement the actual clustering of the original points using
the canopy centers produced by this implementation.
TODO: Sort out the generics
TODO: Allow points to be sparse, to carry payloads for use by other
subsystems, ...
All unit tests run.
- src/main/java/org/apache/mahout/clustering/canopy
- Canopy.java
(addPointToCanopies): applies the distance metric to all canopies,
adding the point to those that are covered
(getCentroid): returns the initial centroid
(getNumPoints): returns the number of points added
(computeCentroid): normalizes the pointTotals with tne numPoints
to return a computed centroid for the canopy
(ptOut, toString, formatPoint): utilities
- CanopyDriver.java
(main): the main program
(runJob): static used by unit tests
- CanopyMapper.java
(map): the map function assigns points to canopies
(config): configuration provided for unit tests
(configure): reads distance measure and threshold from job
(close): writes the canopy centroids to the output
- CanopyReducer.java
(reduce): the reduce function assigns points to canopies
(config): configuration provided for unit tests
(configure): reads distance measure and threshold from job
(close): writes the canopy centroids to the output
- DistanceMeasure.java
(distance): comput the distance between two points by some measure
- EuclideanDistanceMeasure.java
(distance): comput the distance between two points by Euclidean measure
- ManhattanDistanceMeasure.java
(distance): comput the distance between two points by Manhattan measure
- src/test/java/org/apache/mahout/clustering/canopy
- DummyOutputCollector.java
(collect): collects output data
(getData): returns output data for unit tests
- TestCanopy.java
(addPoint): overrides Canopy method to add point to a list
(toString): overrides Canopy method to add point printout
- TestCanopyCreation.java
(setUp): uses published algorithm to initialize reference data
(testReferenceManhattan, testReferenceEuclidean): validates reference data
(testIterativeManhattan, testIterativeEuclidean): uses optimized
algorithm and verifies result vs. reference data
(testCanopyMapperManhattan, testCanopyMapperEuclidean,
testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
mapper and reducer with test data
(testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
resulting canopy centroids
> Build initial canopy clustering prototype
> -----------------------------------------
>
> Key: MAHOUT-3
> URL: https://issues.apache.org/jira/browse/MAHOUT-3
> Project: Mahout
> Issue Type: New Feature
> Reporter: Jeff Eastman
> Attachments: MAHOUT-3.diff, MAHOUT-3a.diff
>
>
> I'd like to reserve some namespace, specifically
> org.apache.mahout.clustering.canopy to use for an initial prototype of canopy
> clustering. I'm going to start with a little unit test to get the basic
> algorithm sorted out, then M/R it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.