[ 
https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722244#action_12722244
 ] 

Jeff Eastman commented on MAHOUT-136:
-------------------------------------

I don't think so for Canopy, since canopies are never sent between map and 
reduce steps. The above commit sends only the centroid vector as Writable from 
the mapper. For reducer output, only the identifier is used for the key and the 
centroid for the value. The ClusterMapper no longer outputs the entire Canopy 
definition with each point; only the identifier. This is similar to what Kmeans 
does. 

Kmeans clusters, OTOH, do need to be serialized in order to save their state 
between iterations. The centroids and identifiers need to be saved together so 
making them writable is an option. I was going to create a new issue to do that 
but MANOUT-137 does that. It is still not clear to me whether or not to use 
Writable or Json format for these inter-iteration results, since the number of 
clusters will typically be small and readability probably trumps density.

> Change Canopy MR Implementation to use Vector Writable
> ------------------------------------------------------
>
>                 Key: MAHOUT-136
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-136
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.1
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.2
>
>
> Internal serialization of Canopy currently uses asFormatString rather than 
> just making the Canopy writable. This is storage inefficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to