[
https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved MAHOUT-357.
------------------------------
Resolution: Won't Fix
Seems like a thread that died during GSoC discussion.
> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
> Key: MAHOUT-357
> URL: https://issues.apache.org/jira/browse/MAHOUT-357
> Project: Mahout
> Issue Type: New Feature
> Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the
> clustering algorithm on mapreduce.So,now I am going to tell what I have
> thought to implement this. Thinking of the k-means algorithm for
> clustering,it appears that the whole set of data has to copied on each of the
> nodes of the hadoop framework to process the data in each iteration of the
> k-means clustering. But, this can be done without useless replication of
> data on the clusters.First of all, we select a set of k elements as the
> initial clusters.This can be purely random or decided on the basis of some
> criteria.We maintain a file which stores the id of each cluster, the number
> of elements in each cluster, and the exact position of the cluster in terms
> of its co-ordinates.This file has to be shared by each of the nodes. During
> each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it
> calculates the distance of each of the element from the k cluster centroids.
> For each row,the smallest distance is chosen and the id of the cluster and
> the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the
> co-ordinates for all the elements is calculated and the number of elements in
> that cluster. So, the combiner funnction outputs the cluster id and the
> average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the
> weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and
> the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition
> set for the convergence is not satisfied, by applying the map-combine-reduce
> phases again.
> I have proposed this as per my understanding of the probelem and my
> knowledge. If anybody have any doubts or want to add anything or suggest
> anything anything,then please respond as soon as possible. And, if you
> consider it a good idea, then please suggest how to proceed further in the
> Gsoc procedure.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.