That makes sense, though I don't understand why the reducer is not doing its job in the test you cite. I've had to do manual things (like calling close() in the unit tests to get all of the functionality to exercise. All of the clustering algorithms behave similarly: each cluster has a center (prior) which is used to observe some of the data (observations) based upon a distance function (pdf), which is used to compute its new centroid (posterior). I think it is possible to abstract them into a common framework using this model.

Grant Ingersoll (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723067#action_12723067 ]
Grant Ingersoll commented on MAHOUT-137:
----------------------------------------

I see the problem now with KMeans (and likely Fuzzy KMeans, and it is a source of confusion. Namely, it's the whole relationship between Cluster.center and Cluster.centroid. It seems as the Cluster goes from formatCluster through decodeCluster the centroid (computed in formatCluster) then becomes the center for the next time around. In the testKMeansReducer, this never happens since we aren't serializing through the string layer.
Obviously, I can correct this in the test, but it seems a bit strange.  AIUI, 
the center holds the current iteration center and it seems like the centroid is 
the result of where the center is being moved to, right?  This does indeed 
happen in my implementation of Writable, but since that isn't being called in 
the test, it doesn't occur.

Convert Clustering Algs to use Vector Writable
----------------------------------------------

                Key: MAHOUT-137
                URL: https://issues.apache.org/jira/browse/MAHOUT-137
            Project: Mahout
         Issue Type: Improvement
           Reporter: Grant Ingersoll
           Assignee: Grant Ingersoll
            Fix For: 0.2

        Attachments: MAHOUT-137.patch, MAHOUT-137.patch, MAHOUT-137.patch, 
MAHOUT-137.patch, MAHOUT-137.patch


All M/R jobs should use Vector writable instead of encoding and decoding 
strings.  We can have a separate utility that converts serialized GSON, 
Strings, whatever into the appropriate vectors.  See MAHOUT-136 and 
http://www.lucidimagination.com/search/document/6a55f260826fd77f/jira_commented_mahout_136_change_canopy_mr_implementation_to_use_vector_writable


Attachment: PGP.sig
Description: PGP signature

Reply via email to