[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks

Ted Dunning (JIRA) Wed, 01 Oct 2008 06:34:05 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636035#action_12636035
 ]


Ted Dunning commented on MAHOUT-79:
-----------------------------------


The combiner should definitely be used if possible.  Generally this requires 
that there be some kind of sufficient statistic that summarizes partial 
results.  With k-means that means that you have to keep the number of points 
that have been combined and their sum.  With Gaussian mixture modeling you need 
to keep the sum and correlation matrix.  Presumably there should be some 
similar statistic that can be kept for fuzzy k-means.  The key characteristic 
of such sufficient statistics is that they can be combined 0 or more times 
without any problem.  Of course, the mapper has to put out individual points in 
the form of trivial statistics such as a count of one and a "sum" of one point.

Using a combiner can result in a decrease in the size of the data processed by 
the reducer by nearly the average size of each cluster.  This can easily be a 
factor of 100 or more.  Given that you made dramatic improvements in speed by 
moving less data to the reducer, this should have very significant effect.



> Improving the speed of Fuzzy K-Means by optimizing data transfer between map 
> and reduce tasks
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-79
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-79
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>         Attachments: FUZZY.patch
>
>
> Improve the speed of fuzzy k-Means by passing only the cluster-id info as key 
> output of mapper task and reading the cluster information in reducer task 
> where this info is needed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks

Reply via email to