Yeah. I totally agree that combiner should definitely be used if we are sure
that we are not doing any transformations in the combiner. For example, word
count.
But, in the current implementation of fuzzy k-means or k-means, we are
implicitly assuming that a point goes thru combiner exactly once. Because, we
have different kind of interpretation of the data outputted by combiner or
mapper.
For example, Assume that a point didn't pass thru KMeansCombiner and reached
reducer directly. Now, we assume that the value that we get at reducer is a
partial sum and number of points. And we try to extract those values from the
value string and there it fails as it doesn't contain the number of points
because it is single point.
And, now assume that a combiner ran more than once. It ran once for few points
and created partial sum and number of points. It ran again over new set of
points also with the previous partial sum and number of points. Then, as there
is different interpretation for the data (one is <vector, numpoints> other are
just plain vectors), it fails.
To make it specific, In KMeans, we are calling addPoint() method in case of
combiner and addPoints() in case of reducer.
Also, we are passing different kind of data: Mapper output
contains only point, where as combiner output contains point and number of
points.
Personally, I faced the first issue that a point directly reached reducer
without going thru combiner and the reducer failed to add the point as there
was parsing issue.
We can fix k-means by passing <point,1> as value output of mapper and use
addPoints method.
But, in order to make it work for current fuzzy k-means implementation, I have
to make sure that it goes thru combiner only once and have a check at reducer
to find out if there is any point which didn't pass thru the combiner.
Thanks
Pallavi
-----Original Message-----
From: Ted Dunning (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 01, 2008 7:04 PM
To: [email protected]
Subject: [jira] Commented: (MAHOUT-79) Improving the speed of Fuzzy K-Means by
optimizing data transfer between map and reduce tasks
[
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12636035#action_12636035
]
Ted Dunning commented on MAHOUT-79:
-----------------------------------
The combiner should definitely be used if possible. Generally this requires
that there be some kind of sufficient statistic that summarizes partial
results. With k-means that means that you have to keep the number of points
that have been combined and their sum. With Gaussian mixture modeling you need
to keep the sum and correlation matrix. Presumably there should be some
similar statistic that can be kept for fuzzy k-means. The key characteristic
of such sufficient statistics is that they can be combined 0 or more times
without any problem. Of course, the mapper has to put out individual points in
the form of trivial statistics such as a count of one and a "sum" of one point.
Using a combiner can result in a decrease in the size of the data processed by
the reducer by nearly the average size of each cluster. This can easily be a
factor of 100 or more. Given that you made dramatic improvements in speed by
moving less data to the reducer, this should have very significant effect.
> Improving the speed of Fuzzy K-Means by optimizing data transfer between map
> and reduce tasks
> ---------------------------------------------------------------------------------------------
>
> Key: MAHOUT-79
> URL: https://issues.apache.org/jira/browse/MAHOUT-79
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Reporter: Pallavi Palleti
> Attachments: FUZZY.patch
>
>
> Improve the speed of fuzzy k-Means by passing only the cluster-id info as key
> output of mapper task and reading the cluster information in reducer task
> where this info is needed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.