[ 
https://issues.apache.org/jira/browse/MAHOUT-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Eastman updated MAHOUT-3:
------------------------------

    Attachment: MAHOUT-3a.diff

Initial implementation of Canopy generation phase of two-phase Canopy
Clustering algorithm. See unit tests for the evolution of the user
stories leading to the working implementation.

TODO: Implement the actual clustering of the original points using
the canopy centers produced by this implementation.

TODO: Sort out the generics

TODO: Allow points to be sparse, to carry payloads for use by other 
subsystems, ...

All unit tests run.

- src/main/java/org/apache/mahout/clustering/canopy
  - Canopy.java
    (addPointToCanopies): applies the distance metric to all canopies,
      adding the point to those that are covered
    (getCentroid): returns the initial centroid
    (getNumPoints): returns the number of points added
    (computeCentroid): normalizes the pointTotals with tne numPoints 
      to return a computed centroid for the canopy
    (ptOut, toString, formatPoint): utilities
  - CanopyDriver.java
    (main): the main program
    (runJob): static used by unit tests
  - CanopyMapper.java
    (map): the map function assigns points to canopies
    (config): configuration provided for unit tests
    (configure): reads distance measure and threshold from job
    (close): writes the canopy centroids to the output
  - CanopyReducer.java
    (reduce): the reduce function assigns points to canopies
    (config): configuration provided for unit tests
    (configure): reads distance measure and threshold from job
    (close): writes the canopy centroids to the output
  - DistanceMeasure.java
    (distance): comput the distance between two points by some measure
  - EuclideanDistanceMeasure.java
   (distance): comput the distance between two points by Euclidean measure
  - ManhattanDistanceMeasure.java
   (distance): comput the distance between two points by Manhattan measure
- src/test/java/org/apache/mahout/clustering/canopy
  - DummyOutputCollector.java
    (collect): collects output data
    (getData): returns output data for unit tests
  - TestCanopy.java
    (addPoint): overrides Canopy method to add point to a list
    (toString): overrides Canopy method to add point printout
  - TestCanopyCreation.java
    (setUp): uses published algorithm to initialize reference data
    (testReferenceManhattan, testReferenceEuclidean): validates reference data
    (testIterativeManhattan, testIterativeEuclidean): uses optimized
      algorithm and verifies result vs. reference data
    (testCanopyMapperManhattan, testCanopyMapperEuclidean,
     testCanopyReducerManhattan, testCanopyReducerEuclidean): excercises
      mapper and reducer with test data
    (testManhattanMR, testEuclideanMR): runs Hadoop jobs and verifies
      resulting canopy centroids

> Build initial canopy clustering prototype
> -----------------------------------------
>
>                 Key: MAHOUT-3
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-3
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-3.diff, MAHOUT-3a.diff
>
>
> I'd like to reserve some namespace, specifically 
> org.apache.mahout.clustering.canopy to use for an initial prototype of canopy 
> clustering. I'm going to start with a little unit test to get the basic 
> algorithm sorted out, then M/R it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to