Hello All, Sorry for the cross posting of this question.
I would like to implement the k-median clustering algorithm using map reduce. The k-median clustering algorithm mentioned here is very close to the k-means, except that the centroid of a cluster in the k-median algorithm is the median of the points belonging to this cluster. Please refer to the following link for a formal definition of the median in high dimensional feature spaces. http://en.wikipedia.org/wiki/Geometric_median Unlike in the k-means algorithm, where the centroids of clusters can be updated incrementally in combiners and reducers (since we are computing the average of those points belonging to the same cluster), in the k-median algorithm, it seems to me that we have to collect all the points falling into the same cluster before we can compute its centroid. This process of computing centroids requires lots of memory space and is computationally expensive ( O(n^2) complexity if n points in this cluster ). Could you please give me some advice on how to implement this k-median clustering algorithm more efficiently using map reduce? Thanks, Guohua