[ https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852732#action_12852732 ]
new user commented on MAHOUT-357: --------------------------------- If it is so, then can you elaborate on exactly what do you want to achieve with a new implementation. I mean, do I need to optimize this algorithm or think about an entirely new way of implementation. Please be a little descriptive. > Implement a clustering algorithm on mapreduce > --------------------------------------------- > > Key: MAHOUT-357 > URL: https://issues.apache.org/jira/browse/MAHOUT-357 > Project: Mahout > Issue Type: New Feature > Reporter: new user > > As I mentioned in my previous posts that I am interseted in implementing the > clustering algorithm on mapreduce.So,now I am going to tell what I have > thought to implement this. Thinking of the k-means algorithm for > clustering,it appears that the whole set of data has to copied on each of the > nodes of the hadoop framework to process the data in each iteration of the > k-means clustering. But, this can be done without useless replication of > data on the clusters.First of all, we select a set of k elements as the > initial clusters.This can be purely random or decided on the basis of some > criteria.We maintain a file which stores the id of each cluster, the number > of elements in each cluster, and the exact position of the cluster in terms > of its co-ordinates.This file has to be shared by each of the nodes. During > each iteration of the algorithm, the following steps are done: > 1. As each node has a part of the initial data,during the map phase, it > calculates the distance of each of the element from the k cluster centroids. > For each row,the smallest distance is chosen and the id of the cluster and > the position of that element is stored. > 2.During the combine phase, for each of the cluster, the average of the > co-ordinates for all the elements is calculated and the number of elements in > that cluster. So, the combiner funnction outputs the cluster id and the > average co-ordinates of the elements. > 3. During the reduce phase, the cluster centroid is re-calculated using the > weighted averages of the co-ordinates. > Thus , after these 3 steps, the new value of centorid for each cluster and > the number of elemnts in each cluster is updated. > The above three steps can be performed iteratively as long as the condition > set for the convergence is not satisfied, by applying the map-combine-reduce > phases again. > I have proposed this as per my understanding of the probelem and my > knowledge. If anybody have any doubts or want to add anything or suggest > anything anything,then please respond as soon as possible. And, if you > consider it a good idea, then please suggest how to proceed further in the > Gsoc procedure. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.