[
https://issues.apache.org/jira/browse/MAHOUT-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722244#action_12722244
]
Jeff Eastman commented on MAHOUT-136:
-------------------------------------
I don't think so for Canopy, since canopies are never sent between map and
reduce steps. The above commit sends only the centroid vector as Writable from
the mapper. For reducer output, only the identifier is used for the key and the
centroid for the value. The ClusterMapper no longer outputs the entire Canopy
definition with each point; only the identifier. This is similar to what Kmeans
does.
Kmeans clusters, OTOH, do need to be serialized in order to save their state
between iterations. The centroids and identifiers need to be saved together so
making them writable is an option. I was going to create a new issue to do that
but MANOUT-137 does that. It is still not clear to me whether or not to use
Writable or Json format for these inter-iteration results, since the number of
clusters will typically be small and readability probably trumps density.
> Change Canopy MR Implementation to use Vector Writable
> ------------------------------------------------------
>
> Key: MAHOUT-136
> URL: https://issues.apache.org/jira/browse/MAHOUT-136
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.1
> Reporter: Jeff Eastman
> Assignee: Jeff Eastman
> Fix For: 0.2
>
>
> Internal serialization of Canopy currently uses asFormatString rather than
> just making the Canopy writable. This is storage inefficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.