ps1 this assumes row-wise construction of A based on training set of m n-dimensional points. ps2 since we are doing multiple passes over A it may make sense to make sure it is committed to spark cache (by using checkpoint api), if spark is used
On Fri, Mar 31, 2017 at 10:53 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > here is the outline. For details of APIs, please refer to samsara manual > [2], i will not be be repeating it. > > Assume your training data input is m x n matrix A. For simplicity let's > assume it's a DRM with int row keys, i.e., DrmLike[Int]. > > Initialization: > > First, classic k-means starts by selecting initial clusters, by sampling > them out. You can do that by using sampling api [1], thus forming a k x n > in-memory matrix C (current centroids). C is therefore of Mahout's Matrix > type. > > You the proceed by alternating between cluster assignments and > recompupting centroid matrix C till convergence based on some test or > simply limited by epoch count budget, your choice. > > Cluster assignments: here, we go over current generation of A and > recompute centroid indexes for each row in A. Once we recompute index, we > put it into the row key . You can do that by assigning centroid indices to > keys of A using operator mapblock() (details in [2], [3], [4]). You also > need to broadcast C in order to be able to access it in efficient manner > inside mapblock() closure. Examples of that are plenty given in [2]. > Essentially, in mapblock, you'd reform the row keys to reflect cluster > index in C. while going over A, you'd have a "nearest neighbor" problem to > solve for the row of A and centroids C. This is the bulk of computation > really, and there are a few tricks there that can speed this step up in > both exact and approximate manner, but you can start with a naive search. > > Centroid recomputation: > once you assigned centroids to the keys of marix A, you'd want to do an > aggregating transpose of A to compute essentially average of row A grouped > by the centroid key. The trick is to do a computation of (1|A)' which will > results in a matrix of the shape (Counts/sums of cluster rows). This is the > part i find difficult to explain without a latex graphics. > > In Samsara, construction of (1|A)' corresponds to DRM expression > > (1 cbind A).t (again, see [2]). > > So when you compute, say, > > B = (1 | A)', > > then B is (n+1) x k, so each column contains a vector corresponding to a > cluster 1..k. In such column, the first element would be # of points in the > cluster, and the rest of it would correspond to sum of all points. So in > order to arrive to an updated matrix C, we need to collect B into memory, > and slice out counters (first row) from the rest of it. > > So, to compute C: > > C <- B (2:,:) each row divided by B(1,:) > > (watch out for empty clusters with 0 elements, this will cause lack of > convergence and NaNs in the newly computed C). > > This operation obviously uses subblocking and row-wise iteration over B, > for which i am again making reference to [2]. > > > [1] https://github.com/apache/mahout/blob/master/math-scala/ > src/main/scala/org/apache/mahout/math/drm/package.scala#L149 > > [2], Sasmara manual, a bit dated but viable, http://apache.github. > io/mahout/doc/ScalaSparkBindings.html > > [3] scaladoc, again, dated but largely viable for the purpose of this > exercise: > http://apache.github.io/mahout/0.10.1/docs/mahout-math-scala/index.htm > > [4] mapblock etc. http://apache.github.io/mahout/0.10.1/docs/mahout- > math-scala/index.html#org.apache.mahout.math.drm.RLikeDrmOps > > On Fri, Mar 31, 2017 at 9:54 AM, KHATWANI PARTH BHARAT < > h2016...@pilani.bits-pilani.ac.in> wrote: > >> @Dmitriycan you please again tell me the approach to move ahead. >> >> >> Thanks >> Parth Khatwani >> >> >> On Fri, Mar 31, 2017 at 10:15 PM, KHATWANI PARTH BHARAT < >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> > yes i am unable to figure out the way ahead. >> > Like how to create the augmented matrix A := (0|D) which you have >> > mentioned. >> > >> > >> > On Fri, Mar 31, 2017 at 10:10 PM, Dmitriy Lyubimov <dlie...@gmail.com> >> > wrote: >> > >> >> was my reply for your post on @user has been a bit confusing? >> >> >> >> On Fri, Mar 31, 2017 at 8:40 AM, KHATWANI PARTH BHARAT < >> >> h2016...@pilani.bits-pilani.ac.in> wrote: >> >> >> >> > Sir, >> >> > I am trying to write the kmeans clustering algorithm using Mahout >> >> Samsara >> >> > but i am bit confused >> >> > about how to leverage Distributed Row Matrix for the same. Can >> anybody >> >> help >> >> > me with same. >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > Thanks >> >> > Parth Khatwani >> >> > >> >> >> > >> > >> > >