Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example fails

Shannon Quinn Wed, 25 May 2011 06:23:25 -0700

> Let me see if I'm following you. In the display example, there are 1100, 2d
> vectors generated as raw data D which is 1100x2. Then, the preprocessing
> step uses a distance measure to produce A, which is 1100x1100. They are not
> really affinities, more like distances, so I may have missed the boat on
> that step. Since the distance measure is reducing the [2] dimensionality of
> the Di and Dj vectors with a scalar (aij), I don't see how to reconstruct D
> from A.
>
>
You don't necessarily need to be able to reconstruct D from A, so I suppose
this is where the fourier transform analogy breaks down. A is indexed by row
and column according to the original data, so as long as you know know the
order from which the rows and columns of A were derived from D, then you can
transiently identify the points in D by index.



> KMeans will cluster all the input vectors in an arbitrary order if on a N>1
> cluster and so Di and Dj will lose their index positions in the result. If
> the D vectors are NamedVectors, with their index as the name, then this will
> flow through to the clustered points at the output. The order of those
> points won't bear much relation to the order of the input, but the names
> will be preserved. KMeans does not mess with the order of the elements
> within each D vector. I don't know if this is sufficient or if Lanczos does
> anything similar.
>

Like Ted mentioned, NamedVector may be the key here to identifying the
original points from the clustered projected data. That's probably the right
way to go.


>
> -----Original Message-----
> From: squinn.squ...@gmail.com [mailto:squinn.squ...@gmail.com] On Behalf
> Of Shannon Quinn
> Sent: Tuesday, May 24, 2011 2:10 PM
> To: dev@mahout.apache.org
> Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example
> fails
>
> You're right, that would give you the affinity matrix. However, the
> affinity
> matrix is an easier beast to tame since the matrix is constructed with all
> the points' orders preserved: aff[i][j] is the relationship between
> original_point[i] and original_point[j], so for all practical purposes I
> treat this as the "original data" (since it's easy to go back and forth
> between the two).
>
> Problem is, I'm not sure if the Lanczos solver or K-Means preserve this
> ordering of indices. Does the nth point with label y from the result of
> K-means correspond to the nth row of the column matrix of eigenvectors? If
> so, then does that nth row from the eigenvector matrix also correspond to
> the nth original data point (the one represented by proxy by row n and
> column n of the affinity matrix)? If both these conditions are true, then
> and only then can we say that original_point[n]'s cluster is y.
>
> On Tue, May 24, 2011 at 4:39 PM, Jeff Eastman <jeast...@narus.com> wrote:
>
> > Would that give you the original data matrix, the clustered data matrix,
> or
> > the clustered affinity matrix? Even with the analogy in mind I'm having
> > trouble connecting the dots. Seems like I lost the original data matrix
> in
> > step 1 when I used a distance measure to produce A from it. If the
> returned
> > eigenvectors define Q, then what is the significance of QAQ^-1? And, more
> > importantly, if the Q eigenvectors define the clusters in eigenspace,
> what
> > is the inverse transformation?
> >
> > -----Original Message-----
> > From: squinn.squ...@gmail.com [mailto:squinn.squ...@gmail.com] On Behalf
> > Of Shannon Quinn
> > Sent: Tuesday, May 24, 2011 12:07 PM
> > To: dev@mahout.apache.org
> > Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans
> example
> > fails
> >
> > That's an excellent analogy! Employing that strategy, would it be
> possible
> > (and not too expensive) to do the QAQ^-1 operation to get the original
> data
> > matrix, after we've clustered the points in eigenspace?
> >
> > On Tue, May 24, 2011 at 2:59 PM, Jeff Eastman <jeast...@narus.com>
> wrote:
> >
> > > For the display example, it is not necessary to cluster the original
> > > points. The other clustering display examples only train the clusters
> and
> > do
> > > not classify the points. They are drawn first and the cluster centers &
> > > radii are superimposed afterwards. Thus I think it is only necessary to
> > > back-transform the clusters.
> > >
> > > My EE gut tells me this is like Fourier transforms between time- and
> > > frequency-domains. If this is true then what we need is the inverse
> > > transform. Is this a correct analogy?
> > >
> > > -----Original Message-----
> > > From: squinn.squ...@gmail.com [mailto:squinn.squ...@gmail.com] On
> Behalf
> > > Of Shannon Quinn
> > > Sent: Tuesday, May 24, 2011 11:39 AM
> > > To: dev@mahout.apache.org
> > > Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans
> > example
> > > fails
> > >
> > > This is actually something I could use a little expert Hadoop
> assistance
> > > on.
> > > The general idea is that the points that are clustered in eigenspace
> have
> > a
> > > 1-to-1 correspondence with the original points (which is how you get
> your
> > > cluster assignments), but this back-mapping after clustering isn't
> > > explicitly implemented yet, since that's the core of the IO issue.
> > >
> > > My block on this is my lack of understanding in how the actual ordering
> > of
> > > the points change (or not?) from when they are projected into
> eigenspace
> > > (the Lanczos solver) and when K-means makes its cluster assignments. On
> a
> > > one-node setup the original ordering appears to be preserved through
> all
> > > the
> > > operations, so the labels of the original points can be assigned by
> > giving
> > > original_point[i] the label of projected_point[i], hence the cluster
> > > assignments are easy to determine. For multi-node setups, however, I
> > simply
> > > don't know if this heuristic holds.
> > >
> > > But I believe the immediate issue here is that we're feeding the
> > projected
> > > points to the display, when it should be the original points
> *annotated*
> > > with the cluster assignments from the corresponding projected points.
> The
> > > question is how to shift those assignments over robustly; right now
> it's
> > > just a hack job in the SpectralKMeansDriver...or maybe (hopefully!)
> it's
> > > just the version I have locally :o)
> > >
> > > On Tue, May 24, 2011 at 2:13 PM, Jeff Eastman <jeast...@narus.com>
> > wrote:
> > >
> > > > Yes, I expect it is pilot error on my part. The original
> implementation
> > > was
> > > > failing in this manner because I was requesting 5 eigenvectors
> > > (clusters). I
> > > > changed it to 2 and now it displays something but it is not even
> close
> > to
> > > > correct. I think this is because I have not transformed back from
> eigen
> > > > space to vector space. This all relates to the IO issue for the
> > spectral
> > > > clustering code which I don't grok.
> > > >
> > > > The display driver begins with the sample points and generates the
> > > affinity
> > > > matrix using a distance measure. Not clear this is even a correct
> > > > interpretation of that matrix. Then spectral kmeans runs and produces
> 2
> > > > clusters which I display directly. Seems like this number should be
> > more
> > > > like the k in kmeans, and 5 was more realistic given the data. I
> > believe
> > > > there is a missing output transformation to recover the clusters from
> > the
> > > > eigenvectors but I don't know how to do that.
> > > >
> > > > I bet you do :)
> > > >
> > > > -----Original Message-----
> > > > From: Shannon Quinn (JIRA) [mailto:j...@apache.org]
> > > > Sent: Tuesday, May 24, 2011 8:07 AM
> > > > To: dev@mahout.apache.org
> > > > Subject: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans
> example
> > > > fails
> > > >
> > > >
> > > >    [
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038608#comment-13038608
> > > ]
> > > >
> > > > Shannon Quinn commented on MAHOUT-524:
> > > > --------------------------------------
> > > >
> > > > +1, I'm on it.
> > > >
> > > > I'm a little unclear as to the context of the initial Hudson comment:
> > the
> > > > display method is expecting 2D vectors, but getting 5D ones?
> > > >
> > > > > DisplaySpectralKMeans example fails
> > > > > -----------------------------------
> > > > >
> > > > >                 Key: MAHOUT-524
> > > > >                 URL:
> > https://issues.apache.org/jira/browse/MAHOUT-524
> > > > >             Project: Mahout
> > > > >          Issue Type: Bug
> > > > >          Components: Clustering
> > > > >    Affects Versions: 0.4, 0.5
> > > > >            Reporter: Jeff Eastman
> > > > >            Assignee: Jeff Eastman
> > > > >              Labels: clustering, k-means, visualization
> > > > >             Fix For: 0.6
> > > > >
> > > > >         Attachments: aff.txt, raw.txt, spectralkmeans.png
> > > > >
> > > > >
> > > > > I've committed a new display example that attempts to push the
> > standard
> > > > mixture of models data set through spectral k-means. After some
> > tweaking
> > > of
> > > > configuration arguments and a bug fix in EigenCleanupJob it runs
> > spectral
> > > > k-means to completion. The display example is expecting 2-d clustered
> > > points
> > > > and the example is producing 5-d points. Additional I/O work is
> needed
> > > > before this will play with the rest of the clustering algorithms.
> > > >
> > > > --
> > > > This message is automatically generated by JIRA.
> > > > For more information on JIRA, see:
> > > http://www.atlassian.com/software/jira
> > > >
> > >
> >
>

Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example fails

Reply via email to