RE: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks

Palleti, Pallavi Wed, 01 Oct 2008 08:08:16 -0700

Yeah. I totally agree that combiner should definitely be used if we are sure 
that we are not doing any transformations in the combiner. For example, word 
count.
But, in the current implementation of fuzzy k-means or k-means, we are 
implicitly assuming that a point goes thru combiner exactly once. Because, we 
have different kind of interpretation of the data outputted by combiner or 
mapper.


For example, Assume that a point didn't pass thru KMeansCombiner and reached 
reducer directly. Now, we assume that the value that we get at reducer is a 
partial sum and number of points. And we try to extract those values from the 
value string and there it fails as it doesn't contain the number of points 
because it is single point.

And, now assume that a combiner ran more than once. It ran once for few points 
and created partial sum and number of points. It ran again over new set of 
points also with the previous partial sum and number of points. Then, as there 
is different interpretation for the data (one is <vector, numpoints> other are 
just plain vectors), it fails.

To make it specific, In KMeans, we are calling addPoint() method in case of 
combiner and addPoints() in case of reducer.
                     Also, we are passing different kind of data: Mapper output 
contains only point, where as combiner output contains point and number of 
points. 

Personally, I faced the first issue that a point directly reached reducer 
without going thru combiner and the reducer failed to add the point as there 
was parsing issue.

 We can fix k-means by passing  <point,1> as value output of mapper and use 
addPoints method.

But, in order to make it work for current fuzzy k-means implementation, I have 
to make sure that it goes thru combiner only once and have a check at reducer 
to find out if there is any point which didn't pass thru the combiner. 


Thanks
Pallavi




-----Original Message-----
From: Ted Dunning (JIRA) [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 01, 2008 7:04 PM
To: [email protected]
Subject: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by 
optimizing data transfer between map and reduce tasks


    [ 
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636035#action_12636035
 ] 

Ted Dunning commented on MAHOUT-79:
-----------------------------------


The combiner should definitely be used if possible.  Generally this requires 
that there be some kind of sufficient statistic that summarizes partial 
results.  With k-means that means that you have to keep the number of points 
that have been combined and their sum.  With Gaussian mixture modeling you need 
to keep the sum and correlation matrix.  Presumably there should be some 
similar statistic that can be kept for fuzzy k-means.  The key characteristic 
of such sufficient statistics is that they can be combined 0 or more times 
without any problem.  Of course, the mapper has to put out individual points in 
the form of trivial statistics such as a count of one and a "sum" of one point.

Using a combiner can result in a decrease in the size of the data processed by 
the reducer by nearly the average size of each cluster.  This can easily be a 
factor of 100 or more.  Given that you made dramatic improvements in speed by 
moving less data to the reducer, this should have very significant effect.



> Improving the speed of Fuzzy K-Means by optimizing data transfer between map 
> and reduce tasks
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-79
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-79
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>         Attachments: FUZZY.patch
>
>
> Improve the speed of fuzzy k-Means by passing only the cluster-id info as key 
> output of mapper task and reading the cluster information in reducer task 
> where this info is needed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by optimizing data transfer between map and reduce tasks

Reply via email to