[ 
https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852732#action_12852732
 ] 

new user commented on MAHOUT-357:
---------------------------------

If it is so, then can you elaborate on exactly what do you want to achieve with 
a new implementation. I mean, do I need to optimize this algorithm or think 
about an entirely new way of implementation. Please be a little descriptive.

> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the 
> clustering algorithm on mapreduce.So,now I am going to tell what I have 
> thought to implement this. Thinking of the k-means algorithm for 
> clustering,it appears that the whole set of data has to copied on each of the 
> nodes of the hadoop framework to process the data in each iteration of the 
> k-means clustering. But, this can be done without useless replication  of 
> data on the clusters.First of all, we select a set of k elements as the 
> initial clusters.This can be purely random or decided on the basis of some 
> criteria.We maintain a file which stores the id of each cluster, the number 
> of elements in each cluster, and the exact position of the cluster in terms 
> of its co-ordinates.This file has to be shared by each of the nodes. During 
> each iteration of the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it 
> calculates the distance of each of the element from the k cluster centroids. 
> For each row,the smallest distance is chosen and the id of the cluster and 
> the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the 
> co-ordinates for all the elements is calculated and the number of elements in 
> that cluster. So, the combiner funnction outputs the cluster id and the 
> average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the 
> weighted averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and 
> the number of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition 
> set for the convergence is not satisfied, by applying the map-combine-reduce 
> phases again.
> I have proposed this as per my understanding of the probelem and my 
> knowledge. If anybody have any doubts or want to add anything or suggest 
> anything anything,then please respond as soon as possible. And, if you 
> consider it a good idea, then please suggest how to proceed further in the 
> Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to