[ 
https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722275#action_12722275
 ] 

Grant Ingersoll commented on MAHOUT-136:
----------------------------------------

bq. I don't think so for Canopy, since canopies are never sent between map and 
reduce steps. The above commit sends only the centroid vector as Writable from 
the mapper. For reducer output, only the identifier is used for the key and the 
centroid for the value. The ClusterMapper no longer outputs the entire Canopy 
definition with each point; only the identifier. This is similar to what Kmeans 
does. 

That seems fine, Canopy as a Writable can do the same thing.  The thing I don't 
get is what is going on between the CanopyReducer and the ClusterMapper.  As I 
read the code, it seems like the CanopyDriver is going to output Canopy as a 
formatted string (canopy id, centroid) but then it seems like the ClusterMapper 
is expecting a Vector, but I'm not sure I understand how the configure stuff 
plays in.  If I upload what I have, can you take a look?

> Change Canopy MR Implementation to use Vector Writable
> ------------------------------------------------------
>
>                 Key: MAHOUT-136
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-136
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.2
>
>
> Internal serialization of Canopy currently uses asFormatString rather than 
> just making the Canopy writable. This is storage inefficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to