[
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636035#action_12636035
]
Ted Dunning commented on MAHOUT-79:
-----------------------------------
The combiner should definitely be used if possible. Generally this requires
that there be some kind of sufficient statistic that summarizes partial
results. With k-means that means that you have to keep the number of points
that have been combined and their sum. With Gaussian mixture modeling you need
to keep the sum and correlation matrix. Presumably there should be some
similar statistic that can be kept for fuzzy k-means. The key characteristic
of such sufficient statistics is that they can be combined 0 or more times
without any problem. Of course, the mapper has to put out individual points in
the form of trivial statistics such as a count of one and a "sum" of one point.
Using a combiner can result in a decrease in the size of the data processed by
the reducer by nearly the average size of each cluster. This can easily be a
factor of 100 or more. Given that you made dramatic improvements in speed by
moving less data to the reducer, this should have very significant effect.
> Improving the speed of Fuzzy K-Means by optimizing data transfer between map
> and reduce tasks
> ---------------------------------------------------------------------------------------------
>
> Key: MAHOUT-79
> URL: https://issues.apache.org/jira/browse/MAHOUT-79
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Reporter: Pallavi Palleti
> Attachments: FUZZY.patch
>
>
> Improve the speed of fuzzy k-Means by passing only the cluster-id info as key
> output of mapper task and reading the cluster information in reducer task
> where this info is needed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.