> Let me see if I'm following you. In the display example, there are 1100, 2d > vectors generated as raw data D which is 1100x2. Then, the preprocessing > step uses a distance measure to produce A, which is 1100x1100. They are not > really affinities, more like distances, so I may have missed the boat on > that step. Since the distance measure is reducing the [2] dimensionality of > the Di and Dj vectors with a scalar (aij), I don't see how to reconstruct D > from A. > > You don't necessarily need to be able to reconstruct D from A, so I suppose this is where the fourier transform analogy breaks down. A is indexed by row and column according to the original data, so as long as you know know the order from which the rows and columns of A were derived from D, then you can transiently identify the points in D by index.
> KMeans will cluster all the input vectors in an arbitrary order if on a N>1 > cluster and so Di and Dj will lose their index positions in the result. If > the D vectors are NamedVectors, with their index as the name, then this will > flow through to the clustered points at the output. The order of those > points won't bear much relation to the order of the input, but the names > will be preserved. KMeans does not mess with the order of the elements > within each D vector. I don't know if this is sufficient or if Lanczos does > anything similar. > Like Ted mentioned, NamedVector may be the key here to identifying the original points from the clustered projected data. That's probably the right way to go. > > -----Original Message----- > From: squinn.squ...@gmail.com [mailto:squinn.squ...@gmail.com] On Behalf > Of Shannon Quinn > Sent: Tuesday, May 24, 2011 2:10 PM > To: dev@mahout.apache.org > Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example > fails > > You're right, that would give you the affinity matrix. However, the > affinity > matrix is an easier beast to tame since the matrix is constructed with all > the points' orders preserved: aff[i][j] is the relationship between > original_point[i] and original_point[j], so for all practical purposes I > treat this as the "original data" (since it's easy to go back and forth > between the two). > > Problem is, I'm not sure if the Lanczos solver or K-Means preserve this > ordering of indices. Does the nth point with label y from the result of > K-means correspond to the nth row of the column matrix of eigenvectors? If > so, then does that nth row from the eigenvector matrix also correspond to > the nth original data point (the one represented by proxy by row n and > column n of the affinity matrix)? If both these conditions are true, then > and only then can we say that original_point[n]'s cluster is y. > > On Tue, May 24, 2011 at 4:39 PM, Jeff Eastman <jeast...@narus.com> wrote: > > > Would that give you the original data matrix, the clustered data matrix, > or > > the clustered affinity matrix? Even with the analogy in mind I'm having > > trouble connecting the dots. Seems like I lost the original data matrix > in > > step 1 when I used a distance measure to produce A from it. If the > returned > > eigenvectors define Q, then what is the significance of QAQ^-1? And, more > > importantly, if the Q eigenvectors define the clusters in eigenspace, > what > > is the inverse transformation? > > > > -----Original Message----- > > From: squinn.squ...@gmail.com [mailto:squinn.squ...@gmail.com] On Behalf > > Of Shannon Quinn > > Sent: Tuesday, May 24, 2011 12:07 PM > > To: dev@mahout.apache.org > > Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans > example > > fails > > > > That's an excellent analogy! Employing that strategy, would it be > possible > > (and not too expensive) to do the QAQ^-1 operation to get the original > data > > matrix, after we've clustered the points in eigenspace? > > > > On Tue, May 24, 2011 at 2:59 PM, Jeff Eastman <jeast...@narus.com> > wrote: > > > > > For the display example, it is not necessary to cluster the original > > > points. The other clustering display examples only train the clusters > and > > do > > > not classify the points. They are drawn first and the cluster centers & > > > radii are superimposed afterwards. Thus I think it is only necessary to > > > back-transform the clusters. > > > > > > My EE gut tells me this is like Fourier transforms between time- and > > > frequency-domains. If this is true then what we need is the inverse > > > transform. Is this a correct analogy? > > > > > > -----Original Message----- > > > From: squinn.squ...@gmail.com [mailto:squinn.squ...@gmail.com] On > Behalf > > > Of Shannon Quinn > > > Sent: Tuesday, May 24, 2011 11:39 AM > > > To: dev@mahout.apache.org > > > Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans > > example > > > fails > > > > > > This is actually something I could use a little expert Hadoop > assistance > > > on. > > > The general idea is that the points that are clustered in eigenspace > have > > a > > > 1-to-1 correspondence with the original points (which is how you get > your > > > cluster assignments), but this back-mapping after clustering isn't > > > explicitly implemented yet, since that's the core of the IO issue. > > > > > > My block on this is my lack of understanding in how the actual ordering > > of > > > the points change (or not?) from when they are projected into > eigenspace > > > (the Lanczos solver) and when K-means makes its cluster assignments. On > a > > > one-node setup the original ordering appears to be preserved through > all > > > the > > > operations, so the labels of the original points can be assigned by > > giving > > > original_point[i] the label of projected_point[i], hence the cluster > > > assignments are easy to determine. For multi-node setups, however, I > > simply > > > don't know if this heuristic holds. > > > > > > But I believe the immediate issue here is that we're feeding the > > projected > > > points to the display, when it should be the original points > *annotated* > > > with the cluster assignments from the corresponding projected points. > The > > > question is how to shift those assignments over robustly; right now > it's > > > just a hack job in the SpectralKMeansDriver...or maybe (hopefully!) > it's > > > just the version I have locally :o) > > > > > > On Tue, May 24, 2011 at 2:13 PM, Jeff Eastman <jeast...@narus.com> > > wrote: > > > > > > > Yes, I expect it is pilot error on my part. The original > implementation > > > was > > > > failing in this manner because I was requesting 5 eigenvectors > > > (clusters). I > > > > changed it to 2 and now it displays something but it is not even > close > > to > > > > correct. I think this is because I have not transformed back from > eigen > > > > space to vector space. This all relates to the IO issue for the > > spectral > > > > clustering code which I don't grok. > > > > > > > > The display driver begins with the sample points and generates the > > > affinity > > > > matrix using a distance measure. Not clear this is even a correct > > > > interpretation of that matrix. Then spectral kmeans runs and produces > 2 > > > > clusters which I display directly. Seems like this number should be > > more > > > > like the k in kmeans, and 5 was more realistic given the data. I > > believe > > > > there is a missing output transformation to recover the clusters from > > the > > > > eigenvectors but I don't know how to do that. > > > > > > > > I bet you do :) > > > > > > > > -----Original Message----- > > > > From: Shannon Quinn (JIRA) [mailto:j...@apache.org] > > > > Sent: Tuesday, May 24, 2011 8:07 AM > > > > To: dev@mahout.apache.org > > > > Subject: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans > example > > > > fails > > > > > > > > > > > > [ > > > > > > > > > > https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038608#comment-13038608 > > > ] > > > > > > > > Shannon Quinn commented on MAHOUT-524: > > > > -------------------------------------- > > > > > > > > +1, I'm on it. > > > > > > > > I'm a little unclear as to the context of the initial Hudson comment: > > the > > > > display method is expecting 2D vectors, but getting 5D ones? > > > > > > > > > DisplaySpectralKMeans example fails > > > > > ----------------------------------- > > > > > > > > > > Key: MAHOUT-524 > > > > > URL: > > https://issues.apache.org/jira/browse/MAHOUT-524 > > > > > Project: Mahout > > > > > Issue Type: Bug > > > > > Components: Clustering > > > > > Affects Versions: 0.4, 0.5 > > > > > Reporter: Jeff Eastman > > > > > Assignee: Jeff Eastman > > > > > Labels: clustering, k-means, visualization > > > > > Fix For: 0.6 > > > > > > > > > > Attachments: aff.txt, raw.txt, spectralkmeans.png > > > > > > > > > > > > > > > I've committed a new display example that attempts to push the > > standard > > > > mixture of models data set through spectral k-means. After some > > tweaking > > > of > > > > configuration arguments and a bug fix in EigenCleanupJob it runs > > spectral > > > > k-means to completion. The display example is expecting 2-d clustered > > > points > > > > and the example is producing 5-d points. Additional I/O work is > needed > > > > before this will play with the rest of the clustering algorithms. > > > > > > > > -- > > > > This message is automatically generated by JIRA. > > > > For more information on JIRA, see: > > > http://www.atlassian.com/software/jira > > > > > > > > > >